This post is part travelogue, part breadcrumbs, part manual on how to troubleshoot a problem. It is posted here as a signpost to others, to discuss how I approached a particular problem with a 2013 Mac Pro cylinder that refused to power down.
The problem began innocuously enough. Sometime after 10.13.3, my Dad’s 2013 Mac Pro, which powers his digital photography workflows, refused to shutdown. The behavior was odd enough, if you chose Shutdown from the Apple Menu – or, as we began the process, Restart – it would close out from all open applications, log out the active user session, and then sit at a black screen, doing close to nothing. External storage volumes would occasionally report disk activity with blinking indicator lights, but the system would never shutdown, never return to a login screen, and when a key on the keyboard was depressed, make a simple “bonk” noise that indicates that input is currently unwelcome. The mouse cursor would still move, but a click would be fruitless.
The system was effectively hung after logout, but before the shutdown task was complete.
It’s important to know the actors in any play, so let’s discuss setup.
My Dad in his retirement enjoys photography, and he takes a lot of pictures. He’s also quite skilled at digital editing, and keeps Adobe Creative Suite handy to do this work. Photoshop, Bridge, Lightroom, these are his stock and trade. He has several nice cameras that take ridiculously large images, so he has amassed a collection of external storage, including a Promise RAID, a couple OWC ThunderBay arrays, and some various and sundry external storage volumes that are helpful in keeping good backups.
I have learned my backup
paranoia sensitivity by watching him take the same diligence he once used to operate a submarine’s nuclear reactor to keep his backups in line. He maintains a bootable clone and multiple Time Machine backups, which came in very handy as we began to troubleshoot the problem.
So, we had:
- A 2013 Mac Pro, well equipped with D500 Graphics cards and 64GB of RAM
- 2 x NEC ColorSync displays, connected via Mini DisplayPort and USB for calibration purposes
- 2 x OWC Thunderbay arrays for storage, running with SoftRAID involved for redundancy and/or speed
- 1 x Promise Pegasus R6 array
- 1 x OWC Thunderbolt 2 Dock for additional USB storage
- 2 x secondary USB 3 hubs for connecting peripherals, including some film scanners and some printing tools
There’s a lot of hardware here.
Where Do We Start?
Well, there’s the obvious steps, the ones that would surely be on everyone’s list. Suffice to say, they’re easy, so we tried them first:
1. PRAM Zap
2. SMC Reset
They were entirely fruitless. The machine was heard audibly guffawing as we tried it. Still, I know Apple’s going to want us to do it before we beg for warranty support, so, we did them.
Next were removing hardware factors:
3. Disconnect Everything But The Displays, Keyboard and Mouse
I was honestly hopeful this was going to be it, because it was going to lead to some sort of weird hardware combo that lead to a race condition at shutdown time or something.
4. Update All The Things
SoftRAID was a point release or two out of date, so we updated the kernel extension and driver, and tried again.
This was also not it.
5. Reinstall the Operating System
This used to be an awful experience. Thankfully, it’s not anymore. Boot from Recovery, run the installer again, restart, cross fingers, sacrifice goat, dance naked under the light of the full moon.
6. Safe Boot
This was where things took an interesting turn. And actually, we did this part for the first time after step 3, but we started to look at the serious methods below to look for a permanent solution.
After a Safe Boot, we were able to shut down the machine.
Safe Boot as Apple helpfully explains will:
- Only load the required System kexts
- Prevents LaunchDaemons and LaunchAgents from loading
- Disables user-installed Fonts
- Deletes Caches for the Kernel and the System, and resets the Font caches
That isolated our problem to one of those four areas. As the
/Library/Extensions folder is protected by System Integrity Protection, starting there seemed foolhardy. We opted for the LaunchDaemons and LaunchAgents.
At first, I started by disabling a few that we might be able to live without entirely, leaving some key LDs and LAs in place to promote the usability of the environment in the event that would solve the problem. That was not helpful. I eventually did what many old-school Mac Admins will remember doing: disabling them all, in the hopes that a clean boot would identify the culprit. Then, at least, you can use split-half testing to identify which of these objects were causing the problem.
But this wasn’t successful either.
The Fonts definitely weren’t the problem, so we left that part alone.
This was the point at which we began to consider Serious Measures™ to fix the issue.
But First: Why This Method?
Troubleshooting a problem of this difficulty is an effort to balance severity of the solution’s effects on the user’s working environment and the ability to seek out a non-destructive solution. It would be cavalier to just wipe the internal drive and re-stage the machine without knowing what wouldn’t fix it, and it could be destructive to the workflow of the user, so we left that for last. What we opted to try were the solutions I’d call the “quick” fixes:
- Zap the PRAM
- Reset the SMC
- Safe Boot to clear the System and Kernel Caches
These steps can solve tricky problems and they do it in just a few minutes, getting the user back on their feet quickly. These are non-destructive solutions as they only dump resources that are quickly rebuilt by the system in a programmatic way.
What I really wanted in the midst of all this was an equivalent to the verbose boot for the shutdown process, but that process doesn’t exist, and searching through the Console and System log in the 10.12+ era is worse than looking for a needle in a haystack. So I started to look for key identifiers of a potential solution by eliminating variables.
Searching for a hardware problem can be a challenge, especially if you have bus conflicts, or related issues due to the large number of USB and Thunderbolt devices in play, so I removed those from the equation early to eliminate a key source of potential interference in the system’s good operation.
With those gone, reducing the number of root-level processes seemed to be the next key target, as user-level events were eliminated by the logout of the user. A quick attempt at culling ancient remains of programs long out of use, but whose LaunchAgents and LaunchDaemons remained behind, were part of what came next. I was sincerely hoping that there was some obscure abandoned LD or LA that was triggering our failed shutdown, but with that gone, we were left with just one solution.
As a last ditch effort, we booted from an external clone to test whether or not it might be the internal SSD that was causing the issue. The failure to shutdown was independent of the volume that it was booted from, and related purely to what was stored in the OS.
The Solution: Nuke. Pave. Migrate Back.
Yes, this is the sad end to this tale. We were left with a solution that was unappealing in its chance to damage the user’s workflow, but we were out of options. So, we opted for:
- Create a bootable clone of the boot volume
- Back the boot volume up to a Time Machine destination (or two. or three.)
- Boot from Recovery
- Wipe the boot volume
- Reinstall macOS High Sierra from Recovery
- Test the system’s functionality to look for a hardware error.
- Once testing is complete, use Migration Assistant to restore from the bootable clone.
This is what solved our issue.
I suspect we had a rogue or old kext that was protected by SIP, and had we disabled SIP and done split-half testing we might have found it sooner.
But, this is how I worked the problem, and I hope that if you’re reading this in front of your Mac Pro that won’t shut down, you might take some ideas from this post.
Many thanks to John Lamb, Eric Holtam, Ron Sanders, Owen Pragel, and Graham Gilbert for advice and encouragement!