= Nagios = Nagios is a powerful monitoring system that enables organizations to identify and resolve IT infrastructure problems before they affect critical business processes. == Check nagios configuration == {{{#!highlight sh /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg # status service on CentOS service nagios status # restart service on CentOS service nagios restart }}} == Install nagios client NRPE on CentOS 5.X == {{{#!highlight sh cd /tmp wget dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm rpm -ivh epel-release-5-4.noarch.rpm yum install -y nrpe nagios-plugins-all openssl cp /etc/nagios/nrpe.cfg /etc/nagios/nrpe.cfg.back # Edit /etc/nagios/nrpe.cfg and add to allowed_hosts the IP address of the Nagios server service nrpe status service nrpe start chkconfig nrpe on cat /etc/xinetd.d/nrpe # check if exists to see where the config is }}} == Install nagios client NRPE on CentOS 6.X == {{{#!highlight sh cd /tmp wget dl.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm rpm -ivh epel-release-6-8.noarch.rpm yum install -y nrpe nagios-plugins-all openssl cp /etc/nagios/nrpe.cfg /etc/nagios/nrpe.cfg.back # Edit /etc/nagios/nrpe.cfg and add to allowed_hosts the IP address of the Nagios server service nrpe status service nrpe start chkconfig nrpe on cat /etc/xinetd.d/nrpe # check if exists to see where the config is }}} == Sample service config == {{{#!highlisht bash define service{ use generic-service service_description Secure shell check_command check_ssh } define service{ use generic-service service_description Web server check_command check_http } define service{ use generic-service service_description RDP Windows check_command check_tcp!3389 } }}} == Check command == {{{#!highlight bash /usr/local/nagios/libexec/check_nrpe -H 192.168.1.1 -c check_metric }}} == Check http with hostname, url and content == Plugin description: https://www.nagios-plugins.org/doc/man/check_http.html Search for commands.cfg and define a new command to handle the hostname string and url. {{{ define command{ command_name check_http_content command_line $USER1$/check_http -w 5 -c 10 -H $ARG1$ -s "$ARG2$" -u $ARG3$ } define service{ use generic-service host_name SERVERXYZ service_description HTTP www.example.net/urlx/ check_command check_http_content!www.example.net!strInHtml!/urlx/ } }}} * check_http -w 5 -c 10 -H hostname -s "string" -u /urlx/ {{{ ./check_http -w 5 -c 10 -H www.sapo.pt -s "teste" -u / HTTP OK: HTTP/1.1 200 OK - 391565 bytes in 0.376 second response time |time=0.376469s;5.000000;10.000000;0.000000 size=391565B;;;0 ./check_http -w 5 -c 10 -H www.sapo.pt -s "testesss" -u / HTTP CRITICAL: HTTP/1.1 200 OK - string not found - 391720 bytes in 0.386 second response time |time=0.385639s;5.000000;10.000000;0.000000 size=391720B;;;0 }}} == Sample /etc/nagios/nrpe.cfg config file on CentOS 6.3 64 bit, host checked by Nagios == {{{ log_facility=daemon pid_file=/var/run/nrpe/nrpe.pid server_port=5666 nrpe_user=nrpe nrpe_group=nrpe #nagios server 192.168.1.2 allowed_hosts=196.168.1.2,127.0.0.1 dont_blame_nrpe=0 debug=1 command_timeout=60 connection_timeout=300 include_dir=/etc/nrpe.d/ command[check_users]=/usr/lib64/nagios/plugins/check_users -w 5 -c 10 command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20 command[check_hda1]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /dev/hda1 command[check_zombie_procs]=/usr/lib64/nagios/plugins/check_procs -w 5 -c 10 -s Z command[check_total_procs]=/usr/lib64/nagios/plugins/check_procs -w 150 -c 200 }}} == check_oracle_health shows warning when OK == [[http://forums.meulie.net/viewtopic.php?f=62&t=6282]] == Nagios shows Warning when is OK == Run the command locally on the target server with the user nrpe. Give permission to login to the user. {{{#!highlight sh su nrpe mycommand.py echo $? }}} If there are problems of permissions or similar when running with the user nrpe, solve them testing the commands under the user nrpe. If the command creates or uses files, delete them if the command tested with root, so they are created properly with the right permissions. == NRPE: Unable to read output (CentOS 6.3) == Make sure the following is setted: {{{#!highlight sh setenforce 0 nano /etc/sysconfig/selinux # change to SELINUX=disabled }}} == Change number processes on check_total_procs command == If we are getting too many warning mesages regardig the total number of processes do the following: * ps uax | wc -l # count the number of active processes on the machine, nrProcs * Edit the file /etc/nagios/nrpe.cfg * Change command[check_total_procs] setting the warning -w to nrProcs + 20 and -c to nrProcs+20+30 * Save the file * service nrpe restart == Sample plugin in Python == === Deploy plugin in Ubuntu NRPE client === * Ubuntu plugin location: /usr/lib/nagios/plugins/acme.py {{{#!highlight python #!/usr/bin/python import sys import datetime nowx = datetime.datetime.now() filename= '/home/userx/app/%02d%02d%02d_x.log'%(nowx.year,nowx.month,nowx.day) filex=open(filename) acmes={} for line in filex: if 'ACME did' in line: splitted = line.split(' ') acmeName=splitted[2] if acmeName in acmes: acmes[acmeName]=int(acmes[acmeName])+1 else: acmes[acmeName]=1 filex.close() print 'OK - Nr acme: %d|nrAcme=%d;;;;'%( len(acmes) , len(acmes) ) res=0 # OK-0 WARNING-1 CRITICAL-2 UNKNOWN-3 sys.exit(res) }}} * Config NRPE: /etc/nagios/nrpe.cfg * command[check_acme]=/usr/lib/nagios/plugins/acme.py * service --status-all * service nagios-nrpe-server restart === Nagios plugin configuration on nagios server === * check_nrpe -H 192.168.1.123 -c check_acme #check plugin * define the service * nagios -v nagios.cfg #Check config * service nagios restart Service definition: {{{ define service{ use generic-service host_name ACMESRV service_description Acmes check_command check_nrpe!check_acme contacts abc.xyz.responsible } }}} === Edit map file for Nagiosgraph === {{{ # Number Acme /perfdata:.*nrAcme=([\d]+);;;;/ and push @s, ['Number Acme', ['NrAcme', GAUGE, $1 ] ]; }}} == Check TCP port plugin == {{{#!highlight python #!/usr/bin/env python import socket import sys import time host = sys.argv[1] port = int(sys.argv[2]) retValue=0 # OK start = time.mktime(time.gmtime()) timeout=5 delta=timeout try: s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.settimeout(timeout) s.connect((host, port)) s.close() end = time.mktime(time.gmtime()) delta=end-start except Exception,ex: retValue=2 # CRITICAL delta=timeout if retValue==0: print 'OK - Connection time:%d|connTime=%d;;;;'%(delta,delta) if retValue==2: print 'ERROR - Connection time:%d|connTime=%d;;;;'%(delta,delta) sys.exit(retValue) }}}