Friday 24 May 2013

creating a check for a device with SNMP and Nagios

I'm assuming your monitoring software is based on nagios. 

First stop is to check if a check already exists in the monitoring system.

If not check http://exchange.nagios.org/. Download and read the script. Understand it and test it.

If you can't find one, you have two options. Create one from scratch or use check_snmp.

To use check_snmp you need to know the correct OIDs. Contact the vendor of the device or check the documentation sometimes they have all of this in one document. Otherwise use snmpwalk
snmpwalk -v2c -c communityname 192.168.1.10. You'll have to go through all the OID's, find the value you are interested in monitoring. You can setup check snmp with the OID.

 snmpwalk -v2c -c  communityname 192.168.1.10 1.3.6.1.4.1.20632.5.14
SNMPv2-SMI::enterprises.20632.5.14 = STRING: "42.0 degrees C"

That OID "1.3.6.1.4.1.20632.5.14" is for CPU temp.

Lets try it with check snmp
./check_snmp -H 10.7.11.219 -C cudaSNMP -o 1.3.6.1.4.1.20632.5.14
SNMP OK - "43.0 degrees C" |

So this check will return ok so long as the temp stays at 43. That's fine for static values, but for changing values its no good. You can use the -r switch

This means the check will be ok so long as its under 49 degrees
./check_snmp -H 10.7.11.219 -C cudaSNMP -o 1.3.6.1.4.1.20632.5.14 -r "4[0123456789]"
SNMP OK - "42.0 degrees C" |

If I changed it to 50 - 59, it would alert
./check_snmp -H 10.7.11.219 -C cudaSNMP -o 1.3.6.1.4.1.20632.5.14 -r "5[0123456789]"
SNMP CRITICAL - *"42.0 degrees C"* |

If you want to write a check from scratch its a good idea to look at some checks already on http://exchange.nagios.org/. You'll need to get all the OID's you need. You'll also have to figure out what each value means. This can be a lot of work, forcing certain situations (unplugging cables etc) and checking the values returned. This is why most people just use check_snmp with a simple ok / not ok check.


command for finding interfaces which have not been used on cisco switches


Only available on 4500's with supervisor
# SHOW INTERFACE LINK

# show int | i proto|Last in

# show int | i proto.*notconnect|proto.*administratively down|Last in.* [6-9]w|Last in.*[0-9][0-9]w|[0-9]y|disabled|Last input never, output never, output hang never

This last command and filters out text you don't need.

Investigating high CPU usage on cisco switches

show processes cpu sorted | excl 0.00%  0.00%  0.00%

This command will show you the process that is using the most CPU. If its over 5% then there is a problem. Google the process name to see what it does and take it from there. View the graphs in your monitoring system to narrow down when it started. Check the logs from the switch.

9200 16.x TS steps
https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-xe-16/213549-troubleshoot-high-cpu-usage-in-catalyst.html

  • IOSd
  • LSMPI
  • FED 
  • Doppler ASIC
  • Physical interface

IOSd: This is the Cisco IOS® daemon that runs on the Linux kernel. It is run as a software process within the kernel

LSMPI: Linux Shared Memory Punt Interface

Forwarding Engine Driver (FED): This is the heart of the Cisco Catalyst switch and is responsible for all hardware programming/forwarding


  • Packet Delivery System (PDS): This is the architecture and process of how packets are delivered to and from the various subsystems. As an example, it controls how packets are delivered from the FED to the IOSd and vice versa
  • Control Plane (CP): The control plane is a generic term used to group together the functions and traffic that involve the CPU of the Catalyst Switch. This includes traffic such as Spanning Tree Protocol (STP), Hot Standby Router Protocol (HSRP), and routing protocols that are destined to the switch, or sent from the switch. This also includes application layer protocols like Secure Shell (SSH), and Simple Network Management Protocol (SNMP) that must be handled by the CPU
  • Data Plane (DP): Typically the data plane encompasses the hardware ASICs and traffic that is forwarded without assistance from the Control Plane
  • Punt: Ingress protocol control packet which intercepted by DP sent to the CP to process it
  • Inject: CP generated protocol packet sent to DP to egress out on IO interface(s)

show processes cpu sorted 5min | e 0.00%  0.00%  0.00% problem.
Look for highest execution time

show platform hardware fed switch active qos queue stats internal cpu policer

show platform software fed switch active punt cause summary
show platform software fed switch active punt cause clear
show platform software fed switch active punt cause summary

show platform software fed switch active punt cpuq rates | e 0        0        0        0        0        0

show platform software fed switch active punt rates interfaces

show platform software fed switch active punt rates interfaces 0x000001d2

show platform software fed switch active punt rates interfaces 0x000001d2 | e 0        0        0        0

show monitor capture cpuCap buffer brief

show monitor capture cpuCap buffer detailed

Packet captures on switch
https://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst3850/software/release/16-3/configuration_guide/b_163_consolidated_3850_cg/b_163_consolidated_3850_cg_chapter_01001011.html

show processes cpu history
*'s show spikes, #'s are used for average


Script for  Intermittent High CPU
From https://www.cisco.com/c/en/us/support/docs/ios-nx-os-software/ios-xe-16/213549-troubleshoot-high-cpu-usage-in-catalyst.html#anc28

In the event that the high CPU on the switch is intermittent, it is possible to set up a script on the switch to automatically run these commands at the time of high CPU events. The entry-val is used to determine how high the CPU is before the script triggers. The script monitors the 5 second CPU average SNMP OID. Two files are written to the flash, tac-cpu-<timestamp>.txt contains the command outputs, and tac-cpu-<timestamp>.pcap contains the CPU ingress capture. These files can then be reviewed at a later date.

config t
no event manager applet high-cpu authorization bypass
event manager applet high-cpu authorization bypass
event snmp oid 1.3.6.1.4.1.9.9.109.1.1.1.1.3.1 get-type next entry-op gt entry-val 80 poll-interval 1 ratelimit 300 maxrun 180
action 0.01 syslog msg "High CPU detected, gathering system information."
action 0.02 cli command "enable"
action 0.03 cli command "term exec prompt timestamp"
action 0.04 cli command "term length 0"
action 0.05 cli command "show clock"
action 0.06 regex "([0-9]|[0-9][0-9]):([0-9]|[0-9][0-9]):([0-9]|[0-9][0-9])" $_cli_result match match1
action 0.07 string replace "$match" 2 2 "."
action 0.08 string replace "$_string_result" 5 5 "."
action 0.09 set time $_string_result
action 1.01 cli command "show proc cpu sort | append flash:tac-cpu-$time.txt"
action 1.02 cli command "show proc cpu hist | append flash:tac-cpu-$time.txt"
action 1.03 cli command "show proc cpu platform sorted | append flash:tac-cpu-$time.txt"
action 1.04 cli command "show interface | append flash:tac-cpu-$time.txt"
action 1.05 cli command "show interface stats | append flash:tac-cpu-$time.txt"
action 1.06 cli command "show log | append flash:tac-cpu-$time.txt"
action 1.07 cli command "show ip traffic | append flash:tac-cpu-$time.txt"
action 1.08 cli command "show users | append flash:tac-cpu-$time.txt"
action 1.09 cli command "show platform software fed switch active punt cause summary | append flash:tac-cpu-$time.txt"
action 1.10 cli command "show platform software fed switch active cpu-interface | append flash:tac-cpu-$time.txt"
action 1.11 cli command "show platform software fed switch active punt cpuq all | append flash:tac-cpu-$time.txt"
action 2.08 cli command "no monitor capture tac_cpu"
action 2.09 cli command "monitor capture tac_cpu control-plane in match any file location flash:tac-cpu-$time.pcap"
action 2.10 cli command "monitor capture tac_cpu start" pattern "yes"
action 2.11 cli command "yes"
action 2.12 wait 10
action 2.13 cli command "monitor capture tac_cpu stop"
action 3.01 cli command "term default length"
action 3.02 cli command "terminal no exec prompt timestamp"
action 3.03 cli command "no monitor capture tac_cpu" 

Thursday 23 May 2013

wrong serial number reported by show version on cisco asa

On one of my cisco ASA's I used "sh version" to get the serial number. When I contacted Cisco for support they said it was no good, but I just purchased the support. It turns out there was a bug on the ASA 5515-x where "sh version" reads the serial from the motherboard not the chasis.

The fix is to use "show inventory" the correct serial was reported here and I could get my support

Tuesday 21 May 2013

testing beyond your Cisco ASA

I needed to change some NAT's and DNS entries. The DNS end of things was going to be handled by a 3rd party. The network looked like this:

120.180.240.224 /28
120.180.240.224
Network ID (unusable)
120.180.240.225
HSRP IP Address
120.180.240.226

120.180.240.227
firewall primary outside interface
120.180.240.228
firewall standby outside interface
120.180.240.229
NAT to internal device
120.180.240.230
NAT to a BI Server
120.180.240.231
NAT to a test app server
120.180.240.232
NAT to production app server (www.cust.com)
120.180.240.233
NAT to standby app server
120.180.240.234

120.180.240.235

120.180.240.236

120.180.240.237
Router 2 IP Address 
120.180.240.238
Router 1 IP Address
120.180.240.239
Broadcast address (unusable)

I copied the NAT entries and replaced them with the new public IPs. I copied the existing access-list, replaced the old IPs with the new IPs and applied the new ACL to the outside interface. I did a clear xlate. I assumed that everything was correct but it was not working. I couldn't browse to www.cust.com


Here is a list of steps that were used to resolve the issue:
Confirm the service is up by testing locally on the app servers with localhost or the private IP.
Confirm the public and private IPs are correct.
Look at the NAT entries again "sh run | i static".
Look at the current translations and arp entries "sh xlate" and "sh arp"
Run a packet-tracer command, make sure your NAT and ACL are being hit as expected.
Check your ACL. When I checked there were no hit counts on it the ACL, not 1. However the packet-tracer said everything should work.
Ensure an ACL allows ICMP to all public IPs. Test it with packet-tracer.
Create a packet capture to capture all incoming ICMP traffic.
Create a script to ping all of the public IPs. You should see the traffic coming in on all of them.
For me traffic was only appearing on the firewalls interfaces.
The traffic wasn't making as far as the firewall. 
I assumed wrongly that something else (another firewall) was blocking it.
I contacted the 3rd party. They tried to ping the new public IPs from one of the routers, there was no response and no arp entry.
The issue was on my firewall. For some reason it was not responding to the arp.
The 3rd party was kind enough to put in static routes to the new IPs and everything started working. Except for the BI server which was running on a non standard port. This port did have to be unblocked on a 3rd party firewall.
Next attempt is to reboot the Cisco ASA.
If that fails upgrade.
If that fails leave the temp fix in place. Call Cisco support.







Wednesday 15 May 2013

getting information about stacked cisco switches

show switch ?

installing a new check into opsview

First check the opsview interface to see if the check is already installed.

If not check http://exchange.nagios.org/ or google search

Find a script that looks like it does the job with good ratings. Read the detail make sure there are no bugs affecting your software version / setup. If you can't find a script to do the job you will have to write one from scratch or use the default check_snmp script.

Open the script and get the OID's that they are using. Manual check them with snmpwalk, lets say my OID is "1.3.6.1.4.1.9.9.500.1.2.1.1.6"

snmpwalk -c public -v2c 192.168.0.1 1.3.6.1.4.1.9.9.500.1.2.1.1.6

Read the script see what the value that is returned means. If the script hasn't documented it, you have have to  get the vendors documentation or contact their support.

After testing with snmpwalk you can copy the script to the slave. You may have to "su - nagios" chown the script to nagios, chmod the script 755 and edit the !#/usr/bin/perl at the top of the script to the relevant path on your system.

Test the script by running it manually
./check_snmp_custom_check.pl -H 192.1680.1 -C public

If you are happy with the results you need to import the script into the master.

Copy the script to /usr/local/nagios/libexec
su / chown / chmod / edit #!
Go into the opsview web interface
Configuration -> Service checks
Click the Actions button -> create new service check

Fill in
Name
Description
Service group
Check period 24x7
You should be able to select the plugin check_snmp_custom_check.pl (if its not there try a reload)
Fill in the arguments "-H $HOSTADDRESS$ -C $SNMP_COMMUNITY$" view another check for help

Once complete reload opsview. Now try to add the check to a host (you may need another reload for it to appear).

Now the check should exist to be assigned to other hosts in the future.



update/upgrade the Cisco ASA image

Backup your config and current images (ASA software, ASDM, anyconnect, orig.json)
Also check for certs and licenses

Check what IPS modules are running active ASA
show modules
You may need to shutdown/uninstall the old unused IPS 
ciscoasa# sw-module module ips shutdown
ciscoasa# sw-module module ips uninstall
ciscoasa# reload
ciscoasa# show modules


Download the new images from the Cisco website
Grab the latest asdm/anyconnect while you are there

Check ASA and ASDM compatibility
https://www.cisco.com/c/en/us/td/docs/security/asa/compatibility/asamatrx.html#reference_upj_nkl_x4b


Connect to the device with the console cable and putty logging enabled

Copy images to the device
TFTP the images to the Cisco device, you can use a laptop and the tftpd server

Alternatively if you have ssh access to the ASA you can run the command
On the ASA run "ssh scopy enable"
From your PC run pscp.exe asa931-smp-k8.bin username@100.100.100.100:asa931-smp-k8.bin

Its also possible to put the images on a fat(msdos) formated USB drive and plug into the ASA and copy from there.

Verify the IOS image
On the ASA run verify disk0:/asa825-k8.bin
Take a screen shot of output

Check and update boot settings

sh boot

BOOT variable = disk0:/asa861-2-smp-k8.bin
Current BOOT variable = disk0:/asa861-2-smp-k8.bin
CONFIG_FILE variable =
Current CONFIG_FILE variable =

You can see what image is set to boot. You can copy this image off for safe keeping if you want

The the following command will copy the new ASA image from flash to disk
boot system flash:/filename-of-new-ASA-image.bin
eg
boot system disk0:/asa914-smp-k8.bin
Remove old boot value
no boot system disk0:/asa910-smp-k8.bin

The following command will copy the new ASDM image from flash to disk
asdm image flash:/filename-of-new-ASDM-image.bin
eg
asdm image disk0:/asdm-731-101.bin

Save the config
wr mem

Reload the device to apply the new image
reload

You can watch the console for any error/warning messages. The ASA should boot up with the new image without issue. If there are issues you can roll back to the old image and call cisco support with your log files.

Don't forget to upload images to the secondary device and reload that too



The first thing we need to confirm is the model number and the software version currently running on the ASA ? If its 8.2 or lower that complicates things. A RAM upgrade might be required depending on the model. If its 8.2 we will most likely have to re-write the config by hand but it depends on what is configured. This is because Cisco made major changes to how NAT works on the ASA from 8.2 -> 8.3. See here. If the ASA is running 8.3 or later we should be good to go. Below is an outline of steps.

Preparation
Ensure we have support with Cisco and access to an account that can log a call if required. Record the serial number from the device.
Ask all users to log out of anyconnect before the maintenace window
Identify a system test plan. How do you use your ASA ? Internet access, VPN’s with third parties, remote access VPNs (anyconnect). Web server NAT’s to public IP’s etc.
Connect to the ASA via the console cable with putty logging enabled. Ensure console logging is enabled on the ASA.
Save the running config. Backup the running config, IOS/ASDM and anyconnect images.
Failover the ASA to ensure the secondary ASA is working as is and running the same software version as primary.
Download the latest IOS/ASDM and anyconnect images from Cisco and upload to both ASA’s primary and secondary.
Verify the IOS image.
Just before starting the upgrade, take a basic base line:
  • sh conn count
  • sh xlate count
  • sh crypto isakmp sa
  • sh ver
for use after the upgrade is complete. 
If you have a monitoring solution, check for any existing alerts. Take a screenshot of your dashboard for comparison after upgrade is complete.


Steps

Save the config.
Reload the ASA to apply the new image (your ASA will failover)
Watch the console output as its boots up make a note of any errors.
Once the image is applied, make sure the updated ASA is active, failover to it.
Ensure everything is working as expected on the new software version. You will need to run your systems test plan here, internet access, VPNs, anyconnect etc.
Once confirmed apply the new image on the other ASA and reload that.
Failover and repeat tests to ensure both ASA are functioning as expected on the new software version.
Ensure you have saved the config “wr”.

Post upgrade tasks
Compare your base line, you will want to see active connections/xlates happening. The numbers won’t be as high outside business hours. We will want to see VPNs up. You might need to generate some traffic on the LAN to get VPNs to come up.
Check your monitoring and ensure everything is working as expected.
Check show ver for licenses applied
Check the anyconnect interface for cert applied
Optionally delete old image files from the ASA. There is usually plenty of space on  them but for some older models it can be an issue.
Normally the new software will be consuming more RAM, your monitoring solution might report this. It can be ignored. The upgrade may also change some SNMP values so you might find some checks in your monitoring solution are no longer working and will need to be resolved.

Sample CLI
*** Before changes take screenshots
sh conn count
sh xlate count
sh crypto isakmp sa


*** Confirm secondary is standby ready
sh failover state

*** Fail over onto secondary
no failover active

*** Set the boot variable
sh boot
boot system disk0:/asa9-12-3-12-smp-k8.bin

*** Set the ASDM variable
sh run | i asdm image
asdm image disk0:/asdm-7122.bin

*** If upgrading any connect
Log off all anyconnect users (may need to disable anyconnec too)
vpn-sessiondb logoff anyconnect
webvpn
no enable OUTSIDE
no anyconnect enable
vpn-sessiondb logoff anyconnect

no anyconnect image disk0:/anyconnect-win-4.7.01076-webdeploy-k9.pkg 1
no anyconnect image disk0:/anyconnect-linux64-4.7.01076-webdeploy-k9.pkg 2
no anyconnect image disk0:/anyconnect-macos-4.7.01076-webdeploy-k9.pkg 3

sh vpn-sessiondb anyconnect

anyconnect image disk0:/anyconnect-win-4.9.04043-webdeploy-k9.pkg 1
anyconnect image disk0:/anyconnect-linux64-4.9.04043-webdeploy-k9.pkg 2
anyconnect image disk0:/anyconnect-macos-4.9.04043-webdeploy-k9.pkg 3

anyconnect enable

You can delete the old anyconnect file off the ASA if you get strange issues with sessions held open
show webvpn anyconnect

*** Save config and Reload the secondary
wr
reload

*** Wait 10 minutes

*** check for the secondary to reboot
sh standby 
waiting it to show as standby ready

*** Fail back over to secondary (new software)
no failover active

*** basic tests
ping 8.8.8.8
sh dns (if anysetup)
sh conn count
sh xlate count
sh crypto isakmp sa
anyconect portal login and check version

*** Customer tests
If all is good failback to primary and complete same steps to install new software on primary
no failover active (check if boot var is set)
connect to vpn.domain.com etc check for any cert issues

no crypto commands on cisco switch, can't enable ssh

I was trying to enable ssh on a switch but the crypto commands were not present. The was because the image that was running on the switch didn't have cryptographic services installed. Some countries don't allow cryptographic services and omitting it is also done to make the switch cheaper.

If you do a show version you will see something like the following
System image file is "flash:/c2970-lanbase-mz.122-25.SED/c2970-lanbase-mz.122-25.SED.bin"
We would want to see K9  in the image name not lanbase

You can download and apply the K9 image from Cisco provided you have the correct license (and even if you haven't purchased the license, but that's not legal)

As a temporary workaround, I connected the console cable to a server that I have access to and connected that way.