PoolManager
HealthService uses PoolManager to manage the resource pools it belongs to. PoolManager creates a new pool object for each resource pool. Each such pool also has three sub-parts:
- Process Sends lease requests
- Observer Handles lease requests
- Client Sends check requests and handles check ACKs
A lease is like a ticket that contains availability information and has a certain expiration period. If a lease expires before it is renewed, then the process stops acting in the pool, drops all the instances managed by it for that particular pool, and becomes unavailable.
The length of time (in seconds) that the lease is valid is calculated using the PoolManager registry keys and the following formula:
PoolLeaseRequestPeriodSeconds + 2 * PoolNetworkLatencySeconds
The process for resource pool availability is calculated across the HealthServices in a pool using the PoolManager and its sub-parts. This process is as follows:
- The process sends a lease request to all other observers in the pool (including its ownobserver).
- Each observer answers with a lease grant, which renews the lease the process has forall other processes in the pool.
- The client sends a check request to all observers in the pool (including its ownobserver) to verify that they still have a valid lease.
- Each observer replies with a check ACK telling the requesting client if it still has a validlease or if it has expired (and is considered unavailable).
In this system, if any of the messages fails to be sent or received, the party involved is considered unavailable. When a process or a client sends a request, the answering observers must build a quorum in order for the answer to be valid. For example, of a total of three observers (including its own observer), at least two of them must reply and have the same information about availability. If a quorum is not met, then the process is unavailable.
The lease grant must be received within a certain time of the lease request or it will also be invalid and the involved process will be set as unavailable. This period (in seconds) is controlled by the PoolNetworkLatencySeconds registry key.
Besides adding to the number of voting observers, there is another reason why the Operational Database also plays the part of observer for each pool. If you install a single management server in a pool, then in order for the pool to function and for the quorum to be met, at least two observers are needed: its own observer and the database. If either of these fail to answer in the pool, it does not matter because if the database is unavailable for some reason, then everything is down anyway and in a one management server pool, if the management server goes down, then the database cannot act as a process and run workflows anyway. The observer availability information for each existing pool is stored in the database in a table called AgentPoolLease. Note that the database observer is known as the default observer.
It is important to know that each process has a process value and a process counter, which are involved in the lease grant and request. When the HealthService becomes available in the pool again after being considered unavailable, (for example, after the service was stopped or after being in maintenance mode), the process value is incremented (+1). The process counter is incremented after each successful lease request. These values are used to validate the lease grant and request.
A resource pool manages top level instances, also known as top level managed entities, or TLMEs for short. TLMEs are instances that are not managed by a specific HealthService (either an agent, gateway server, or management server), but instead are managed by a resource pool in order to obtain high availability. This does not include Windows agents–these are managed by a parent management server and if it becomes unavailable, they can be configured to fail over to another management server. Previously, in Operations Manager 2007 R2, the TLMEs were managed only by the root management server. So in the case of an outage of the root management server, a lot of TLMEs were not managed by anything, and thus a lot of workflows would no longer run. An example of such a TLME would be the Datawarehouse Synchronization Server. This instance (of its class) is the target class for a lot of important workflows that are related to writing data into the Data Warehouse Database, such as a workflow that synchronizes data (like events) between the Operational Database and the Data Warehouse Database.
In a resource pool, the available TLMEs are split equally, and each HealthService in the pool gets its share.A hashing algorithm based on the GUIDs of the HealthServices in combination with those of the TLME ensures that if all the involved pool members (HealthServices) are available, the exact same TLMEs are given to each member. If a member becomes unavailable, then the hashing is re-done and the entire list of TLMEs from the pool is split equally among the remaining available members. As soon as the unavailable member becomes available again, the same process follows, and it receives the exact same TLMEs it was managing when it became unavailable.
It is important to know that each time there is a pool member change and the TLMEs are reorganized among the available members, a configuration update takes place on all of the available members (HealthServices).
To view all TLMEs that exist per resource pool and per current owning pool member (management server), you can run the following Windows PowerShell script on one of the management servers. It uses Out-GridView to output the data so that you can sort and filter it easily:
Import-Module OperationsManager; New-SCOMManagementGroupConnection -ComputerName "localhost" $GetTLMEfromPoolTask = Get-SCOMTask -Displayname "Get Top Level Instances Monitored By A Pool Member" $HealthServiceClass = Get-SCOMClass -Name "Microsoft.SystemCenter.HealthService" $ResourcePools = Get-SCOMResourcePool; $TLMEInstances = @() foreach($pool in $ResourcePools) { foreach($ms in $pool.Members) { $hs = Get-SCOMClassInstance -Class $HealthServiceClass | ? { $_.DisplayName -eq $ms.DisplayName } $param = @{ PoolId = $pool.Id.ToString() } $out = Start-SCOMTask -Task $GetTLMEfromPoolTask -Instance $hs -Override $param-ErrorAction SilentlyContinue do { $batch = $out.BatchId; Start-Sleep -Seconds 3 } while ($batch -eq $null) do {$result = Get-SCOMTaskResult -BatchId $batch -ErrorAction SilentlyContinue $status = $result.Status; Start-Sleep -Seconds 5 } while($status -eq "Started") [xml]$output = $result.Output if($output -ne $null) { $TLMEs = $output.SelectNodes("//ManagedEntity") foreach($TLME in $TLMEs) { $TLMEInstance = Get-SCOMClassInstance -Id $TLME.GetAttribute("managedEntityId") $TLMEClass = Get-SCOMClass -Id $TLMEInstance.MonitoringClassIds $hash = @{ ResourcePool = $pool.DisplayName; ManagementServer = $ms.DisplayName TLMEFullName = $TLMEInstance.FullName; TLMEDisplayName = $TLMEInstance.DisplayName TLMEClass = $TLMEClass; TLMEId = $TLMEInstance.Id } $obj = New-Object PSObject -Property $hash; $TLMEInstances += $obj } } } } $TLMEInstances | Sort-Object ResourcePool, ManagementServer, TLMEClass | Out-GridView
EventID=15000
Severity=Informational Message=The pool member has initialized %n%nManagement Group: %1 %nManagement Group ID: %2 %nPool Name: %3 %nPool ID: %4 %nPool Version: %5 %nNumber of Pool Members: %6 %nNumber of Observer Only Pool Members: %7 %nNumber of Members Added: %8 %nNumber of Members Removed: %9 %nNumber of Instances: %10 %nNumber of Instances Added: %11 %nNumber of Instances Removed: %12
EventID=15001
Severity=Informational Message=More than half of the members of the pool have acknowledged the most recent initialization check request The pool member will send a lease request to acquire ownership of managed objects assigned to the pool %n%nManagement Group: %1 %nManagement Group ID: %2 %nPool Name: %3 %nPool ID: %4 %nPool Version: %5 %nNumber of Pool Members: %6 %nNumber of Observer Only Pool Members: %7 %nNumber of Instances: %8
EventID=15002
Severity=Error Message=The pool member cannot send a lease request to acquire ownership of managed objects assigned to the pool because half or fewer members of the pool acknowledged the most recent initialization check request The pool member will continue to send an initialization check request %n%nManagement Group: %1 %nManagement Group ID: %2 %nPool Name: %3 %nPool ID: %4 %nPool Version: %5 %nNumber of Pool Members: %6 %nNumber of Observer Only Pool Members: %7 %nNumber of Instances: %8
EventID=15003
Severity=Informational Message=The availability of one or more members of the pool has changed The ownership for all managed objects assigned to the pool will be redistributed between available pool members %n%nManagement Group: %1 %nManagement Group ID: %2 %nPool Name: %3 %nPool ID: %4 %nPool Version: %5 %nLocal Pool Member Available: %6 %nNumber of Pool Members: %7 %nNumber of Observer Only Pool Members: %8 %nNumber of Members Available: %9 %nNumber of Instances: %10 %nNumber of Instances Locally Activated: %11 %nNumber of Instances Locally Deactivated: %12
EventID=15004
Severity=Error Message=The pool member no longer owns any managed objects assigned to the pool because half or fewer members of the pool have acknowledged the most recent lease request The pool member has unloaded the workflows for managed objects it previously owned %n%nManagement Group: %1 %nManagement Group ID: %2 %nPool Name: %3 %nPool ID: %4 %nPool Version: %5 %nNumber of Pool Members: %6 %nNumber of Observer Only Pool Members: %7 %nNumber of Instances: %8