Switches and External Networking
The switches are set up in VLT (virtual link trunking) for the prod 10g and management/provisioning, so one switch of each type can be lost without affecting functionality.
The LACP bonding distributes the links to different switches, so the 2x1g, 4x10g and 2x10g bonds carrying the various vlan traffic should all be able to withstand multiple failures of host nic, switch port, or even switch before functionality is affected.
The Nectar route has connections to both OGG and TDC border routers for redundancy. The failover and load balancing is managed by ITS and should be automatic (primary link being TDC, barring a failure).
The various components of OpenStack are configured with different levels of HA. Should faults occur, here is a high-level overview for how failover will occur. Links with further details will be provided where applicable.
If a host is or will be down, need to disable the nova-compute service from a nova management node. <link needed>
May need to manually live or cold migrate instances from that host to a new host. <link needed>
|Nova Management||ntr-novam0x||Automatic (except for console)||(seems to be automatic in testing). Note that currently the novnc (console) service is running on ntr-novam01. If this host goes down, there is no quick failover (change DNS of console.nectar.auckland.ac.nz to ntr-nova02, change vnc server in puppet to ntr-novam02).|
|Neutron||ntr-neurtonn0x||Manual||Process is a little tricky as the DHCP range needs to be swapped over to new nova server. See procedure <link needed>|
|Glance||ntr-glancn0x||Semi-Automatic||Glance API requests increment through all available servers, which in normal production results in a balanced load. If one fails, the others will continue to process requests, however requests will still be sent (and dropped) to the Glance server that is down. To failover properly, set hiera value in puppet base.yaml that lists Glance servers to only include operational servers.|
|RabbitMQ||ntr-mq0x||Automatic||RabbitMQ cluster (3 nodes currently) has queue-syncing enabled and is HA to tolerate 2 rabbit node failures.|
|OpenStack DB||ntr-db0x||Automatic||MariaDB is in a Galera cluster (three nodes currently) with one primary and two additional backups that sync off of the primary. Two db nodes can be lost without any affect on production. Db will automatically re-sync when nodes are restored.|
|Horizon||n/a||Automatic||U of Melbourne is currently hosting for us, and is an HA environment. When we host our own dashboard, we will need to set up behind HAProxy or similar.|
|Swift||n/a||Automatic||We currently don't yet host Swift distributed object storage nodes, but will in the future. In either case, the system is HA.|
|Internet NAT Gateway||ntr-gw01, ntr-gw02||Manual||All API and DB, and admin/mgmt traffic exits via the NAT gateway at 10.31.xx.253 (normally ntr-gw02 / gw.nectar.auckland.ac.nz / 22.214.171.124). Since this is a crucial SPOF, a clone exists at 126.96.36.199. If the primary gateway goes down, uncomment the two 10.31.xx.253 NICs in /etc/network/interfaces and bring them up (ifup) to restore functionality.|
|Ceph||ntr-sto0x, ntr-cad, ntr-mon0x||Automatic||There are currently 8 storage nodes and 5 Ceph monitors. Ceph can absorb the loss of 2 monitor nodes and up to 2 physical Nodes (36 OSDs or physical drives). However, depending on placement group mapping, the maximum functional loss could be as little as 1 physical node (18 OSDs or physical drives).|
|Cinder||ntr-cindern||None||There is no easy way to set up Cinder with HA in Mitaka or Newton. To partially compensate, a permanent snapshot exists of the VM on its KVM host (ntr-ctr03).|