Dell servers, warranty facts and refresh-mcollective-metadata

On our physical Dell servers we install the dell-omsa packages which give us the ability to monitor our underlying hardware.

With that in place, you can use facter to report on all sorts of useful things about the hardware, including the state of the warranty.

The fact which checks warranty information, uses dell-omsa to pull the service tag of the server and submits it to Dells API – which then returns info about the status of your warranty.

You can then use mcollective to report on this. This can be really useful if you can’t remember what you bought when!

Unfortunately, from time to time it breaks and we start getting cronjob output which looks like this:

/usr/libexec/mcollective/refresh-mcollective-metadata
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_days_left', resolution='': can't dup NilClass
Could not retrieve fact='warranty_start', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass

This happens just frequently enough to be a familiar problem for us, but not frequently enough for the fix to stick in my mind!

Googling for the error messages yeilds a couple of mailinglist threads asking about this error and how to work around it – which were both started by my colleague Jonathan Gazeley the first time we hit the problem. [1]

There are no actual fixes in those threads, although one post did hint at the root cause being mcollective caching the result of the Dell API call – without actually stating where it gets cached.

So, it’s strace time!

sudo strace -e open /usr/libexec/mcollective/refresh-mcollective-metadata 2>&1 | less

Skip to the end, and page back until you get to the bit where it starts complaining about the warranty fact, and you find that it’s trying to open /var/tmp/dell-warranty-XXXXXXX.json where XXXXXXX is the service tag of the hardware.

...
open("/var/tmp/dell-warranty-XXXXXXX.json", O_RDONLY) = 3
Could not retrieve fact='warranty_end', resolution='': undefined method `[]' for nil:NilClass
Could not retrieve fact='warranty_days_left', resolution='': can't dup NilClass
...

In our most recent case, the contents of that file looked like this:

$ cat /var/tmp/dell-warranty-XXXXXXX.json 
{
  "GetAssetWarrantyResponse": {
    "GetAssetWarrantyResult": {
      "Response": null,
      "Faults": {
        "FaultException": {
          "Message": "The tag you sent is not present. Check your separator character and ensure it is |.",
          "Code": 4001
        }
      }
    }
  }
}

That looks a lot to me like the API call failed for some reason.

The fix is to remove that stale cache file and re-run the mcollective-refresh-metadata script.

$ sudo rm /var/tmp/dell-warranty-*.json
$ sudo /usr/libexec/mcollective/refresh-mcollective-metadata

Then inspect the cached file again. It should now contain a lot of warranty info.

If it doesn’t, well… then you need to start working out why, and that’s an exercise left for the reader!

-Paul
[1] https://groups.google.com/forum/#!msg/puppet-users/LsK3HbEBMGc/-DSIOMNCDzIJ

I freely admit that the intent behind this post is mostly about getting the “fix” into those google search results – so I don’t have to resort to strace next time it happens!