A conceptual overview of high availability for a Cortex XSOAR multi-tenant deployment.
With high availability, your system stays up and running, even if one component of the system fails. High availability is an active-active failover configuration - if a host server goes down, your data is still available, without requiring a restore process.
A high availability multi-tenant environment consists of multiple identical main app servers and multiple high availability groups. Each high availability group contains identical host servers, which run the accounts. Each account belongs to a specific high availability group, and not to an individual host. When you add a host to a high availability group, the host runs all accounts in the group. All hosts in the same high availability group use the same database indexes, providing instant failover if a host becomes unavailable.
The Cortex XSOAR app servers and database are separate. The app servers process requests such as running playbooks and creating custom content, while the Elasticsearch database stores all of the data, including the content (custom and OOTB), indicators, incidents, etc. To achieve full high availability, you install multiple host servers behind a load balancer, which uses a shared file system for the required files and communicates with the designated index in Elasticsearch.
High availability groups provide your system with:
Redundancy for every account in the group.
Redundancy for every host machine in the group.
Performance improvements for every account in the group.
While full high availability architecture for a multi-tenant deployment requires the use of high availability groups, it’s also possible to have hosts that are not part of a high availability group.
If you are migrating an existing BoltDB deployment to high availability, see Migrate Cortex XSOAR Objects to Elasticsearch for Multi-Tenant.
Design and Scaling
All hosts in a high availability group have the same hardware specifications. Each host in the high availability group must have sufficient resources to run all accounts, which creates an effective limit for how many accounts can be in a single high availability group. Once you reach the limit (which depends on how many resources accounts use), you scale by adding additional high availability groups.
The main app server(s) manage load balancing between hosts in the same high availability group. Load balancing within high availability groups uses an internal mechanism that distributes requests between all available hosts in the group. As long as an account is alive and reachable on any of the hosts within that group, requests will be directed to it, offering full availability for that account.
Load balancing on the main app servers level is handled by the external load balancer.
Communication
Each host monitors the health of its accounts and updates the main app server via http. Communication between hosts and main app servers is bidirectional, and requests can be sent over port 443 from hosts to main app servers and vice versa. All app servers have network access to port 9200 for communication with Elasticsearch. Hosts communicate with their accounts on the accounts’ proprietary ports within the host machine. The tenant ports are dynamically allocated by the operating system.