We just recently here executed a major upgrade from ITMS7.5SP1 to ITMS 8.0. The upgrade was ultimately a success, and both the major and minor issues we experienced were resolved in the 48hrs that followed the go-live.
I'm sharing our experience here to give folks an indication of the planning and testing which many of us undertake when embarking on major upgrades like this. The major issue we hit I also thought was worth disseminating, in the hope that others considering this rollout can avoid it.
If you just want the main upgrade gotcha we suffered, and aren't terribly interesting in the planning side, then here it is,
If you are to embark on an upgrade from 7.5SP1, consider changing the policy on your AppID service account so that it does not lock out on multiple failed authentication attempts. We saw a small percentage of agents go rogue and authenticate with incorrect credentials to the SMP during the upgrade process. This eventually locked out the service account and brought down the entire Infrastructure.
The powershell script in this article can be executed on your Domain Controllers to reveal bad agent behaviour.
What follows now is just a discussion of our upgrade plan, it's execution, and the issues that followed.
Before I do that, it's pertinent to ask why are we are upgrading to ITMS8. What does it offer us over 7.5SP1? Well the key reasons for us were as follows:
- Windows 10 client support
- User searching in the console (very nice)
- Agent health data presented through the console icons, and in detail in a flipbook (should help to our IT Techs directly)
- Improved software reporting
- Platform scalabilty improvements and more current SQL Server support (SQL Server 2012 SP2 and SQL Server 2008R2 SP3)
As you can see our first item here was Windows 10. With our Windows 10 rollouts being just around the corner, we just wanted to make sure we were prepared for that very busy Windows 10 future.
1 . Our Environment
Our central ITMS installation is pretty typical - it's a small/medium business setup. We have,
- 1 NS (serving 3000 clients)
- 1 SQL Server
- 1 Site Server (for task offloading)
- 1 Cloud Gateway
When I plan upgrades, I generally split them into 3 phases:
- Preparation
- Upgrade
- Wash-up/Remediation
I'll talk about each of these here so you can get an idea of our overall experience.
Note: In our environment we have two production ITMS installations - one serving 30 clients and another serving 3000 clients. This enables us to execute these 3 phases on the small client base first, and then on the larger client base.
2. Preparation (~1 month)
This is where most of your effort in an upgrade should go if you want to minimise the upgrade time and the wash-up. It's a pain to do, but, seriously, it's worth doing. Here's what we did here,
- Build a virgin virtual server and install our target version of ITMS (v8)
- Check basic functionality of the server, and see how stable it is (lots of log checking)
- Be prepared to raise Symantec Support cases
- Check target version release notes - are there known issues relevant to your environment?
The objective here is to confirm that the target version works with your current processes in principle.
If the testing on the virgin box goes well, then build an ITMS server of your current version (in our case, ITMS 7.5SP1). Configure your core policies to mirror your production box, and then upgrade that. Document what changes, notably in your agent upgrade policies and targets.
In the weeks before the upgrade also,
- Spend more time attending your live SMP.
Make sure that you are happy with the state of the event logs, and are happy with the current functioning of the client estate. You want to upgrade the SMP when it's in prime condition.
- Gather Testing Documentation
Gather "Acceptance Testing" documentation from those that use the console for their day-to-day tasks. Agree what items are critical and should trigger a back-out if they don't work in the post-upgrade scenario.
- Build the Upgrade Checklist
On the day of the upgrade, you don't want to be 'winging it' too much right? Build a checklist, and think about backout plans for each step that introduces a change. This is laborious, but wil serve you well on upgrade day.
- Schedule Upgrade for Production Server
Let people know you are planning the upgrade and that there will be downtime. Let folks know that there is a back-out plan. Let them know what the changes will be. Let them know that it will take a few days for the client upgrade process to rollout (we aimed for 90% coverage in 3 days). If you can, plan the upgrade over a weeked (we thought 2 days we thought would just about do it).
3. Upgrade (~2 days)
On the morning of the upgrade, confirm the sanity of your checklist one last time. Check the release notes for the version you are upgrading to again (they could have been updated). If all looks good, then continue.
3.1 Pre-upgrade Prep
- Firewall off the SMP (this helps put the server in a quiet state ready for snapshotting)
- Check Event Queues, Windows logs, SMP logs (make sure all is well before the upgrade)
- Confirm disk space on SMP and SQL Server (ours is an off-box SQL Server)
- Make sure the infrastructure is quiet and then ensure file backups on all servers are current
- Reboot the SMP (just to make sure we have a clean slate in terms of the OS)
- Snapshot all the Altiris virtual machines. At the very least, we snapshot the SMP and SQL Server
3.2 Execute Upgrade
- Install SMP8 Upgrade prerequisites on SMP (JRE 8 in our case)
- Begin preparing the download of SMP8 in the Symantec Installation Manager (SIM) (~20 mins)
- Install SMP8 (~90 mins to install, ~90 mins to configure)
- Reboot SMP, Check logs
- Install HF5 through SIM (~90 mins)
- Apply "Agent Health Reporting" Power Management Fix (TECH234452)
- Apply custom fix (similar to above) for PcAnywhere
- Check and configure Agent Upgrade Policies (we noticed that many of these will be turned off and/or reset. Go through each plugin and confirm that the policies and targets are as you'd expect)
- Check and confirm Cloud Agent settings correct (noticed https redirection was reset to http, so fixed that)
- Check logs
- Execute relevant portions of "Acceptance Testing" plans logging into the Console with the appropriate group rights
- We found at this point that "IT Anaytics" was broken. As this had already been flagged as an acceptable temporary casualty of the upgrade, we just logged this and moved on.
3.3 Site Server Upgrade
- Install Site Server Upgrade pre-requisites (.NET Framework 4.5.1)
- Enable Site Server agent upgrade policy and check that the site server exists in the policy target
- Enable Firewall rule to enable site server access to SMP
- Confirm Site Server upgrade
3.4 Single Client Testing
- Enable Firewall rule to enable single-test client access
- Observe client upgrade process
- Confirm basic plugin functionality
- Execute relevant portions of "Acceptance Testing" plans logging into the Console with the appropriate group rights
3.5 Upgrade Cloud Gateway
- Enable Firewall on CEM gateway to click all clients
- Install Server Prerequisites (.NET 4.5.1 already installed)
- Upgrade CEM Internet Gateway Package
- Enable Firewall rule to enable single-client client access (whatismyip.com is your friend)
- Observe client upgrade process
- Confirm basic plugin functionality
- Confirm cloud agent switching to and from Cloud mode (using VPN client)
- Execute relevant portions of "Acceptance Testing" plans logging into the Console with the appropriate group rights
3.6 Multiple Client Testing
- Enable Firewall rule to enable multiple-test client access
- Observe client upgrade process
- Confirm basic plugin functionality
- Check logs
- Confirm again relevant portions of "Acceptance Testing" plans logging into the Console with the appropriate group rights
3.7 Go-Live
- Commit Virtual Machine snapshots
- Enable Firewall Rule for All Clients on SMP
- Enable Firewall Rule for All Client on Cloud Gateway
- Monitor logs
And, for us, that was it. I went to bed around midnight Sunday 29th with 300 client machines upgrading nicely. All seemed good.
4. Wash-Up/Remdediation
I began my server checks at 7:30am Monday morning. My expectation was to find a small, niggly issues, and I had ear-marked the following three days to track these down to remediate. However, when I logged into my work machines, I found the SMP was down. Totally down.
4.1 Agent Upgrade DDoS (Major Issue)
So, after allowing myself a good 10 seconds to panic, it was time to track down what caused this massive systems failure. After all, the system was health checked just a few hours ago as being pretty darn stable.
A quick diagnostics revealed that the App ID service account was locked out. Unlocking the account had no effect, as it was just locked out again moments later. So what was locking it? Looking at our domain controller logs, we could see that clients were failing to authenticate with the AppID credential and this was locking out the account. This was however happening too fast for a manual re-enabling of the account to have any effect.
Looking at the clients themselves, it turned out that some had gone rogue during the upgrade process. They were sending corrupted authentication requests to the SMP. These requests were failing their authentication and resulting in a lockout of the service account. After quick emergency discussions, we decided that the simplest action at this point was to temporarily disable account lockouts.
We then cobbled together a powershell script to identify clients which were failing their auth attempts on the App ID and then raised a case with Symantec Support. Here is the powershell we came up with to reveal the machine names which were hitting the domain controller with failed authentications,
Get-WinEvent -LogName Security | where { $_.providername -eq 'Microsoft-Windows-Security-Auditing' -and $_.level -eq 0 -and $_.ID -eq 4776 -and $_.message -match 'CHANGE_ME_TO_YOUR_APP_ID_ACCOUNT'} | Select-Object TimeCreated, Message | select-string "0xC000006A" | Foreach-Object {$_ -Replace "`r`n",""} | Select-String -pattern 'TimeCreated=(\S+\s\S+);.*Logon Account:\s+(\S+)Source Workstation:\s+(\S+)+Error Code:' | % {" $($_.matches.groups[1]) $($_.matches.groups[2]) $($_.matches.groups[3])"} | out-file .\Logging.txt
In the above script, you'll need to change the string "CHANGE_ME_TO_YOUR_APP_ID_ACCOUNT" to your App ID.
Temporarily changing the user account lockout policy allowed the agent upgrades to struggle through. They then began to complete the process, but consequently we'd lost us a day in our plug-in rollout timeline.
If I had to guess at what was happening here... my money would bet that there is a bug with the old agent upgrade process. The 7.5 SP1 agent is quite old, and it was only during the upgrade that this app ID account lockout issue occurred. Once the upgrade of the agent had completed, the agents behaved fine. So, likely not an issue with the new agent, just a bug with the very old one we had, which only manifested when we triggered that code fork in the upgrade process.
4.2 Real-Time Manager (Minor Issue)
This was a minor issue, but we found that Real-Time manager didn't seem to work consistently when the console was accessed over HTTP: Changing the URL to HTTPS resolved this.
4.3 PCAnywhere (Minor Issue)
We had a glitch with PCAnywhere in that the policy targets seemed to undergo a reset in the upgrade. As a result, we were targeting the wrong machines with latest package upgrade. This had the effect of pushing the host-only package to machines which had the full package installed, which removed the PCAnywhere QuickConnect client.
Resolved by fixing the targets and pushing the full package again.
Our testing plan had omitted the step to check the targets for two PCAnywhere packages, so lesson learned here to be more thorough.
I should point out here that a known, documented repercussion of the upgrade is that you cannot configure PCAnywhere anymore in the console. The console object to perform this is corrupted by the installation. As PCAnywhere is EOL, Symantec will not be fixing this.
4.4 Agent Health Reporting (Minor Issue)
Whilst agent health reporting is a really good feature in ITMS8, there is one aspect of it which is niggly. Once it's there, people want to see ticks - not question marks or crosses. So, when we did the upgrade, we applied two little T-SQL patches to make the Power-Plan plugin and PC-Anywhere plugs report as healthy. The patch for the Power Plan plugin health reporting we found in TECH234452, and, to fix PCAnywhere, we made an equivalent patch,
USE [Symantec_CMDB] GO INSERT INTO [dbo].[SmpVersions] ([ProductGuid] ,[PluginGuid] ,[Type] ,[Version] ,[Major] ,[Minor] ,[Build] ,[Revision]) VALUES ('C432B710-F971-11A2-8643-20105BF409AF' -- Guid from vProduct ,'452F2BCF-7261-4AA6-9228-387676F3A183' -- Agent Class Guid from Inv_AeX_AC_Client_Agent ,0 ,'12.6.8556.0' ,12 ,6 ,8556 ,-1) GO
One fly in the ointment however was the recently EOL'd Software Virtualization (Workspace Virtualization and Streaming). We found that when clients were reporting back with higher versions than the SMP was configured to roll out, they were "Unhealthy" as the client and server versions didn't match.
We can't see a way around this, without ramping up the SWS client version on the server. It would be good to have the option in the future of simply not reporting in "Agent health" the status of certain plugins.
But not a big issue, so just threw this on my "contemplation" pile
5. Summary
On reflection, I view the upgrade rather positively. We didn't have to roll back, and I felt the preparation was worth it in making the upgrade and testing process move along swiftly. It also minimized the surprises in the upgrade as we'd already done a few of those in advance of the main rollout.
The major issue, the App ID lockout, impacted us for 90 minutes during business hours (although it felt a lot longer from my point of view). This downtime was small simply because we were in the fortunate position of being able to quickly push though the required account lockout changes. Others might not be in such a fortunate situation, and this is the main motivation for me writing this article.
All in all, it took 72 hours for 85% of the agent upgrades to complete across our estate. We anticiated issues with inventory, software delivery and remote control to occur in this time window. We'd factored 3 days of infrastructure hand-holding following the upgrade in our planning, and the minor issues encountered were in the bulk taken care of in this window too.
6. Useful Links
- ITMS 8.0 Documentation, DOC8649
- ITMS 8.0HF5 Release Notes, DOC9490
- HowTo Upgrade to ITMS 7.6/8.0 and still use PcAnywere, DOC124981
- PCAnywhere EOL, Workspace Virtualisation EOL
- Fixing the Power Management Scheme plug-in health reporting in ITMS8.0
- After upgrading Client Agents from ITMS 7.5x to 8.0, the AppID keeps getting locked out, TECH237012
- IT Analytics broken after ITMS upgrade, TECH 213502