====== Evergreen Availability Monitoring ====== This page contains information to help setup an Evergreen availability monitoring system to monitor your Evergreen system. ===== Presentations & Examples ===== [[http://evergreen-ils.org/wp-content/uploads/2014/04/eg14_Evergreen_Availability_Monitoring-2014.pdf | Evergreen Availability Monitoring, Evergreen 2014 Conference, Michael Tate, Equinox Software]] - Focuses on [[https://www.nagios.org/|Nagios]] but the general principles apply to all monitoring systems. [[http://git.evergreen-ils.org/?p=contrib/equinox.git;a=tree;f=monitoring;hb=HEAD|Equinox Evergreen Monitoring Scripts]] ===== What to Monitor ===== Ideas on what to monitor to know how your Evergreen system is performing. This is roughly ordered from the top of the software stack to the bottom. ==== Load Balancer ==== Monitor the status of your load balancer. Is it currently running, are all the backends considered active. How many redirects a minute is it handling. ==== Apache Server Processes ==== - Is the apache2 process running? - How many apache2 processes are running compared with the max allowed? - How much memory are the apache2 processes using? - Are there any runaway apache2 processes using too much CPU? - Are the http/https/websockets ports open and accessible? - How many requests per second? Bytes sent/received? - How many errors of various types were found in the logs? ==== Apache Performance ==== Monitoring the response time for various web page loads will give you a good glimpse of how your system is performing for users. Some ideas of areas to monitor. * Main Opac Page * Record Detail Page * Search Results using various options * Patron Login / account summary * UNAPI results * HTTP Opensrf gateway requests ==== SIP Server ==== - Is the SIP Server process running? - How many processes are running compared with the max number allowed? - Is the SIP port open(TCP 6001) and responding to requests? ==== NCIP Server ==== ==== Z39.50 Server ==== - Is the Z39.50 Server process running? - How many processes are running compared with the max number allowed? - Is the Z39.50 port open(TCP 210) and responding to requests? ==== OpenSRF Services ==== - Are the opensrf router processes running? - Is there one listener process and at least one drone process for each service enabled? - Are the number of drone processes close to the max_children settings for that service? ==== Reporter ==== ==== Action Trigger ==== - Are there any stale action_trigger_runner.pl lock files? - How many pending events are there compared with your libraries normal level of pending events.