Blog Archives

Check_iostat.pl version 0.9.7

Previous version and detaild information
https://sysengineers.wordpress.com/2010/02/05/check_iostat-for-nagios-version-0-9-5/

Application Goal:
Check disk IO using a simple perl script and possibly the counters allready available in the linux /proc/ directory. The program must be able to run under Nagios and should return a ‘Nagios’ correct syntax to the prompt. This way the plugin should be usable for Linux users to monitor their Disk IO within Nagios. The program should also report back performance information in the ‘Nagios’ compatible syntax so that graphing is available in the ‘Nagios’ / Cacti / Centreon add-ons.

Changes
Fixed lacking devision in validated summed results
Added sprintf functions rounding the validated numbers with 2 dimentions
Fixed some declaration problems
Added -debug switch to enable the debugging options
Coded arround the used math::Complex functions and removed the module
Improved the debugging output (readability)
Small fixes in de syntax

Installation
1. Copy the code below to your clipboard (ctr+c or the copy-button inside the code field)
2. Logon to the linuxbox with the nagios or nrpe client.
3. browse to the plugin-dir usually cd /usr/local/nagios/libexec/
4. Create a new file vi ./check_iostat.pl
5. Press insert and paste the code inside (when using putty paste is done with a rightmouse click)
6. save the file (esc > : > rq > enter, when using vi)
7. give the file execute rights using chmod +x ./check_iostat
8. make nagios the owner chown nagios:nagios ./check_iostat
9. Test it using updatedb& ./check_iostat -d sd -dbu -dbuw 70 -dbuc 88 -kbs -kbsw 10000 -kbsc 50000 -p

Next configure a nagios command or use nrpe and have fun 🙂

#!/usr/bin/perl
# Written by Chris Gralike @ AMIS.
# Perl based check command to fetch and report the
# TPS (transactions per second) and IO wait times.
# Plugin uses iostat for opperation.
# Verion 0.9.7
#
# Changes post 0.9.7 >> 28-05-2010
# Line 298...303 Prevent devision by zerro else exit because no data was collected  - Bug reported by Epiq.
# Line 380       Correction of a type that prevented pref data of r/s w/s from being printed. - Bug reported by Epiq.
# Line 40        Added dm as possible device input used by the linux LVM. - Suggested by Epiq.
# Line 269       Extended the device if/pragmatch validation to match more devices - Added by Jean Ventura.
#
###########################
use Switch;
use warnings;

my($numArgs,
   $debug,	# Print debugging information.
   $DevType,	# Used to match a certain devicetype from the resulting IOstat rows.
   $IOBIN,	# Is used to store a path to the iostat binairy for execution.
   $Samples,	# Used to store the initial Samples returned by iostat.
   @SampleRows, # Used to store the rows generated by the splitted samples.
   $firstseen,  # Used to keep track of the found devices (IOstat might return a set of devices i.e sda, sdb, sdc etc.)
   $Items,      # Used in the foreach to store the row being parsed.
   @cols,       # Used to store the columns in a row after an split.
   $dev,	# Used to create a symbolic link to dynamicly create a var.
   $rqm,
   $val,
   $devtypes,
   $rws, $rws_warn, $rws_crit,
   $kbs, $kbs_warn, $kbs_crit,
   $awt, $awt_warn, $awt_crit,
   $svc, $svc_warn, $svc_crit,
   $devices, $itd,$v1,$v2,$v3,
   $v4,$v5,$v6,$v7,$v8,$v9,$v10,$v11,
   $dbu, $dbu_warn, $dbu_crit
   );

# Preparing to collect the dangerious user input.
# Here is a list of known device types. Please add any device you would like to monitor..
$devtypes=";sd;hd;dm;";
$numArgs = $#ARGV + 1;
$critical_global = 0;
$warning_global = 0;

if($numArgs gt '0'){
	for($i=0;$i<$numArgs;$i++){
	# Process our command line arguments and do some basic testing.
	# Could be make human save in the future.
	switch ($ARGV[$i]) {
		# Enable debugging.
		case '-debug'{
				$debug = 1;
			     }
		# Handle device type
		case '-d'    {
				$val=$ARGV[$i+1];
		 		if( (index($devtypes, $val)) gt '-1'){
					$DevType=$val;
					$i++;
		  		}else{
					print "Ivalid Disktype found. Typo?\n"; exit 1;
		  		}
			     }
		# Do we need to check rqm?
		case '-rqm'  { $rqm='1'; }
		# What is the warning treshold?
		case '-rqmw' {
				$val=$ARGV[$i+1];
				# Is the value nummeric?
				if($val=~m/[0-9]*/){
					$rqm_warn= int $val;
					$i++;
				}else{
				# Possible type?
					print "Non Numeric value used in rqmw, typo? \n"; exit 1;
				}
			     }
		case '-rqmc' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $rqm_crit= int $val;
					$i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in rqmc, typo? \n"; exit 1;
                                }
			     }
		# Do we need to check rws?
		case '-rws'  { $rws='1'; }
		case '-rwsw' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $rws_warn= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in rwsw, typo? \n"; exit 1;
                                }
			     }
		case '-rwsc' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $rws_crit= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in rwsc, typo? \n"; exit 1;
                                }
			     }
		# Do we need to check kbs?
		case '-kbs'  { $kbs='1'; }
		case '-kbsw' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $kbs_warn= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in kbsw, typo? \n"; exit 1;
                                }
			     }
		case '-kbsc' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $kbs_crit= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in kbsc, typo? \n"; exit 1;
                                }
			     }
		# Do we need to check awt?
		case '-awt'  { $awt='1'; }
		case '-awtw' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $awt_warn= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in awtw, typo? \n"; exit 1;
                                }
			     }
		case '-awtc' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $awt_crit= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in awtc, typo? \n"; exit 1;
                                }
			     }
		# Do we need to check svc?
		case '-svc'  { $svc='1'; }
		case '-svcw' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $svc_warn= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in svcw, typo? \n"; exit 1;
                                }
			     }
		case '-svcc' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $svc_crit= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in svcc, typo? \n"; exit 1;
                                }
			     }
		# Do we need to check dbu?
		case '-dbu'  { $dbu='1'; }
		case '-dbuw' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $dbu_warn= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in dbuw, typo? \n"; exit 1;
                                }
			     }
		case '-dbuc' {
				$val=$ARGV[$i+1];
                                # Is the value nummeric?
                                if($val=~m/[0-9]*/){
                                        $dbu_crit= int $val;
                                        $i++;
                                }else{
                                # Possible type?
                                        print "Non Numeric value used in dbuc, typo? \n"; exit 1;
                                }
			     }
		# performance data. Might make the string human unreadable.. ow well, no loss there <img src="https://s-ssl.wordpress.com/wp-includes/images/smilies/icon_smile.gif" alt=":)" class="wp-smiley">
		case '-p'    { $prf='1'; }
		# Print full messages per device overview.
		#case '-m'    { $fms='1'; }
		# Most used help switches.
		case '--help'{ USAGE(); }
		case '-h'    { USAGE(); }
	}
	}
	# Check if the basic requirements are met.
	if(!($DevType) ||  !(($rqm || $rws || $kbs || $awt || $svc || $dbu))){
		print "Minimal requiremens, device and checktype are not met\n";
		USAGE();
	}
}else{
	# No input was given. Show the Usage();
	USAGE();
}

# Locate the IOBin binairy needed to fetch the stats.
chomp($IOBIN=`which iostat`);

if( !(-f $IOBIN) || !(-x $IOBIN)){
	print "A working iostat command is needed for this script to work \n";
	print "Also make sure the sysstat service is running! /etc/init.d/sysstat \n";
	exit 1;
}

if($DevType){
	# IF IOStat is found, lets collect some data.
	chomp($Samples=`$IOBIN -d -x -k 1 5 | grep $DevType`);
}else{
	print "Please select a valid devicetype \n";
	exit 1;
}

# Break the samples up in lines so we can evaluate them.
# The first set of samples are avg. values counted from boot.
# We need to discard them and collect the remaining samples.
@SampleRows=split(/\n/, $Samples);

# Firstseen is used to track de device names and to skip the first itteration of iostat that contains
# avg stats counted from system boot time, stats we cant use here sadly <img src="https://s-ssl.wordpress.com/wp-includes/images/smilies/icon_smile.gif" alt=":)" class="wp-smiley">
$firstseen='';
$CT=0;
$Devices=0;

# Print Debugging information
if($debug){
	    print "### Data collected ###\n";
	    print "dev|rrqm|wrqm|r/s|w/s|rKB/s|wKB/s|rq-sz|qu-sz|await|svctm|util%|\n";
}

foreach $Items (@SampleRows){
	 # Break the latter up in usable columns.
        @cols=split(/\s+/,$Items);
	 # We only want a certain device type. So lets match what we know.
	 # Disks usualy have a prefix for example (scsi = sd) its set using
	 # the DevType var.
         if($cols[0]=~m/$DevType-[0-9]$/ || $cols[0]=~m/$DevType[a-z]$/ || $cols[0]=~m/$DevType[a-z][0-9]$/){

		$dev=$cols[0];
		if((rindex $firstseen, $dev) gt '-1'){
			#Declare new $$
			#my @$dev;
			# Store the collected data in the correct (dynamic) vars.
			$$dev[1]+=$cols[1];	# rrqm/s   Read Requests Merged per Second.
			$$dev[2]+=$cols[2];	# wrqm/s   Write Requests Merged per Second.
			$$dev[3]+=$cols[3];	# r/s      Number of read requests issued per Second
			$$dev[4]+=$cols[4];	# w/s      Number of write requests issued per Second
			$$dev[5]+=$cols[5];	# rKB/s    Number of Kilobytes read per Second.
			$$dev[6]+=$cols[6];	# wKB/s    Number of Kilobytes written per Second.
			$$dev[7]+=$cols[7];	# Avgrq-sz Avarage size (in sectors) of the issued requests.
			$$dev[8]+=$cols[8];	# Avgqu-sz Avarage Queue length of the requests issued.
			$$dev[9]+=$cols[9];	# Await	  Avarage wait time in ms for IO requests to be served.
			$$dev[10]+=$cols[10];	# svctm    Avarage service time in ms for IO requests that where issued.
			$$dev[11]+=$cols[11];   # %util    Precentage of CPU time during IO requests (bandwidth util), saturation at 90~100%
			$CT++; # Add a new itteration to the count
			# Print some debugging vars if requested. to show the data is collected.
			if($debug){
				    print "$dev|$cols[1]|$cols[2]|$cols[3]|$cols[4]|$cols[5]|$cols[6]|$cols[7]|$cols[8]|$cols[9]|$cols[10]|$cols[11]|\n";
			}
		}else{
			$Devices++;
			$firstseen.="$dev;";
		}
	}
}

# Prevent $itd (itterations / disk) from becomming zerro and exit when no devices are found.
# on line 299.
if($Devices > 0){
	$itd = ($CT / $Devices);
}else{
	print "No performance data was captured. Please check if the device name is correct\n"; exit 1;
}

# Print debugging information
if($debug){
	print "###Devices Counted###\n";
	print "Number of devices : $Devices\n";
	print "Number of itterations per device : $itd\n";
	print "Total Number of Itterations : $CT\n";
}
# Lets collect the device information from the firstseen var
# and start processing it for some perf check/data
# Lets also recycle some previously used vars for this.
@cols=split(/;/,$firstseen);
foreach $Items (@cols){
	# Items now contains the devicenames needed to access the data again.
	# We now need to check them against some basic tresholds
	# Print a nice table with the calculated values when we are in debug.

  	# Print debugging information
	if($debug){
		   print "### Counted Values ###\n";
		   print "$Items|$$Items[1]|$$Items[2]|$$Items[3]|$$Items[4]|$$Items[5]|$$Items[6]|$$Items[7]|$$Items[8]|$$Items[9]|$$Items[10]|$$Items[11]|\n";
        }

	#What do we want to check against a treshold?
	#First check the selection if any.
	if($rqm || $rws || $kbs || $awt || $svc || $dbu){

		#Set the counts to zerro
		$critical_state='0';
		$warning_state='0';
		$ok_state='0';
		# Devide
		$round="%.2f";
		$v1 = sprintf($round, ($$Items[1] / $itd));
        $v2 = sprintf($round, ($$Items[2] / $itd));
		$v3 = sprintf($round, ($$Items[3] / $itd));
        $v4 = sprintf($round, ($$Items[4] / $itd));
		$v5 = sprintf($round, ($$Items[5] / $itd));
        $v6 = sprintf($round, ($$Items[6] / $itd));
		$v7 = sprintf($round, ($$Items[7] / $itd));
        $v8 = sprintf($round, ($$Items[8] / $itd));
		$v9 = sprintf($round, ($$Items[9] / $itd));
        $v10 = sprintf($round, ($$Items[10] / $itd));
        $v11 = sprintf($round, ($$Items[11] / $itd));

		# Requests Merged per second.
		if($rqm){
			# Critical
			if(($v1 >= $rqm_crit) || ($v2 >= $rqm_crit)){
				$critical_state+='1';
			# Warning?
			}elsif(($v1 >= $rqm_warn) || ($v2 >= $rqm_warn)){
				$warning_state+='1';
			# Ok
			}else{
				$ok_state+='1';
			}
			# Add the counters to the performance vars
			$perf.="$Items-rrqm/s=$v1; $Items-wrqm/s=$v2;";
		}
		# Reads / Writes per second.
		if($rws){
			if(($v3 >= $rws_crit) || ($v4 >= $rws_crit)){
	                    	$critical_state+='1';
             		# Warning?
               	 	}elsif(($v3 >= $rws_warn) || ($v4 >= $rws_warn)){
                        	$warning_state+='1';
                	# Ok
                	}else{
                        	$ok_state+='1';
                	}
			# Add the counters to the performance var.
			$perf.="$Items-r/s=$v3; $Items-w/s=$v4; ";
		}
		# KB Read/Writes per second.
		if($kbs){
			if(($v5 >= $kbs_crit) || ($v6 >= $kbs_crit)){
	                        $critical_state+='1';
	                # Warning?
       	         	}elsif(($v5 >= $kbs_warn) || ($v6 >= $kbs_warn)){
                        	$warning_state+='1';
                	# Ok
                	}else{
                       	 	$ok_state+='1';
                	}
			$perf.="$Items-rKB/s=$v5; $Items-wKB/s=$v6; ";
		}
		# Avarage wait time
		if($awt){
			if(($v9 >= $awt_crit)){
                        	$critical_state+='1';
                	# Warning?
                	}elsif(($v9 >= $awt_warn)){
                        	$warning_state+='1';
                	# Ok
                	}else{
                        	$ok_state+='1';
                	}
			$perf.="$Items-await=$v9; ";
		}
		# Avarage service time issuing time
		if($svc){
			if($v10 >= $svc_crit){
       	                	$critical_state+='1';
                	# Warning?
                	}elsif($v10 >= $svc_warn){
                        	$warning_state+='1';
                	# Ok
                	}else{
                        	$ok_state+='1';
                	}
			$perf.="$Items-svctm=$v10; "
		}
		# Disk bandwidth Utilization
		if($dbu){
			if($v11 >= $dbu_crit){
                	        $critical_state+='1';
              	  	# Warning?
               	 	}elsif($v11 >= $dbu_warn){
                        	$warning_state+='1';
                	# Ok
                	}else{
                        	$ok_state+='1';
                	}
			$perf.="$Items-util=$v11%; ";
		}
	}else{
		print "At least select a value to measure..\n";
		exit 1;
	}
	# Print Debugging information about the validated values.
	if($debug){ print "### validated Devisions ###\n";
                    print "$Items|$v1|$v2|$v3|$v4|$v5|$v6|$v7|$v8|$v9|$v10|$v11|\n";
	}

	# Create a messages var.
	$mgs.="$Items=O:$ok_state,W:$warning_state,C:$critical_state; ";

	# Track the global state (1 crit 1 warn)
	if(($critical_state gt '0') || ( $critical_global gt '0')){
		$critical_global+='1';
	}
	if(($warning_state gt '0') || ( $warning_global gt '0')){
		$warning_global+='1';
	}
}

# Compose a nice nagios output.
if($critical_global >= '1'){
	print "CRITICAL:";
	$exit=2;
}elsif($warning_global >= '1'){
	print "WARNING:";
	$exit=1;
}else{
	print "OK:";
	$exit=0;
}
# Print the remainder, the most important data was processed.
print $mgs; if($prf){ print "|$perf"; } print "\n";
exit $exit;

###Subroutines
sub USAGE{
	print "
                Usage : $0 -d [Dev] [options]

		-p Print performance data about the measured samples.

		-d {grep string used on IOstat}
		examples;
                 sd     #All scsi devices.
                 hd     #All Cdrom devices.
		 sda	#Only device sda

                [Available Measurement Options]
                -rqm -rqmw val -rqmc val        # read/write merged             [#]
                -rws -rwsw val -rwsc val        # read/write per second.        [s]
                -kbs -kbsw val -kbsc val        # KBs read/written per second.  [s]
                -awt -awtw val -awtc val        # Avarage IO wait time.         [ms]
                -svc -svcw val -svcc val        # Avarage service IO wait time. [s]
                -dbu -dbuw val -dbuc val        # Disk utilization              [%]\n";
        exit 1;
}

Advertisements

check_iostat / check_io for nagios Version 0.9.5

Oke here is the first working version of check_iostat.pl

NEW Version available HERE

https://sysengineers.wordpress.com/2010/05/27/check_iostat-pl-version-0-9-7

Why this version?
Well to my big surprise I was not able to find any satisfying check plugin for diskio that didnt need all kinds of additional software or undocumented perl plugins. What i really wanted was a simple check plugin that would only require the default sysstat daemon allready present on allot of our systems (Oracle DB requirement).

Because no one seemed to have written such a plug-in I decided to write one myself. And here is the result.

How to get started?
Copy paste the source into a check_iostat.pl file within nagios/libexec.

The following needs to be on your Linux machine.
sysstat must be installed!
iostat must be available!
perl (base) should be installed!

What can you do?
1. Let us know on what distro you have tested this code! And or what changes you made to make it work.
2. Tell me if you spotted any improvements in the code.
3. Tell me if there are any bugs.

What more?
Questions? Different requirements? Improvements? Code cleanup?
Please let me know…

Tested with?
PERL
v5.8.8
v5.8.3

LINUX
Enterprise Linux Enterprise Linux Server release 5.3 (Carthage)
Enterprise Linux Enterprise Linux Server release 5.2 (Carthage)
SUSE LINUX Enterprise Server 9 (i586)

Note!
the code is still a bit messy but functional. Keep an eye out here for updates. I will be cleaning up this code further.

Rgrds,
Chris.

Code below…
Read the rest of this entry

Timekeeping in VMware… o my…

If there is a subject that has many and i realy mean many posts, and with these posts many many readers its timekeeping in vmware. Especially when your Guest OS is of the linux platform. Also there are many suggestions on how to solve this problem. Too give you guys a quick glace of whats happening out there… Some of the suggestions you might encounter.

1. Cron the ntpd refresh command. (put the ntp renew in a task and execute it every second)
    (Not realy an option with 100Servers+ and loads, loads of network traffic)
2. Recompile the kernel using the 100Hz frequency setting instead of the 1000 or 250hz setting.
    (One I want to test before discarding it, he might have a point there)
3. Patch the kernel / NTPD using the latest versions.
     (Should be a standard job and best practice, not an suggestion!)
4. Use a VMWare compatible compiled rpm to reinstall the kernel.
     (Sounds much like option 2 i want to test first, ill go for the manual compile 🙂 )
5. dont even want to mention all these other options 
     (too silly but fun reading 🙂 )

With all respect to the guys searching and finding solutions stated above. There was indeed a time these solutions where the best to apply. But time has gone past, vmware introduced solutions using the VMWare tools (almost the same a the cron solution). And communities responded comitted to solve these problems for their most valued distro. The result is a setting in the kernel that is available for various kernels, and these settings can be found on the VMware site. Even though i commited myself to test these various options before implementing one or the other, the bootloader option looks the savest to suggest too the big audience. So here it is.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427

Oh always there are people to thank 🙂
• My uncle for paying way more attention then me 🙂 Marco Gralike
• Prutser for breaking open the kernel discussion, good article there.
http://prutser.wordpress.com/2009/02/08/why-does-my-linux-virtual-machine-lose-time/
•  VMware for maintaining there KB so well 🙂
• You for taking the time to read this nonsense 🙂

Memo : Windows NTP configuration

Ill be short about it, time is important!

The old way….

Net time /setsntp:0.europe.pool.ntp.org

The adviced way….
Register the time service.

w32tm /register

Configure it to sync with an external ntp server.

w32tm /config /update /manualpeerlist:”0.europe.pool.ntp.org 1.europe.pool.ntp.org 2.europe.pool.ntp.org 3.europe.pool.ntp.org” /syncfromflags:MANUAL /reliable:YES

View the current stats

w32tm /monitor

Little warning, i feel should be made.
When you are updating back in time, the service might need some time to slowly correct the time. This is because else conflicts might arise with time dependant services and the like. Keep an eye on your windows Logs and use the /monitor switch the follow the ntp service.

Check which NTP pool to use for your own country at this location : http://www.pool.ntp.org. Also, the listed pools mainly consist of STRATUM 2 public servers. This should be correct enough for your local network ^^.

Make sure that the ntp service can be reached, and make sure DNS is available. Else resolve the pool addresses (that may change over time).
-Rgrds,