At VMworld 2017 VM and vSAN Encryption and security of vSphere in general became VERY popular topics. And in those discussions the topic of Key Managers came up and specifically “How many key managers should I have?” was a recurring question.
This blog article will give you two examples of key manager topologies and will introduce you to some management concepts. Because every environment comes with unique configurations and requirements, the intent is not to “boil the ocean” but to just get you thinking and help you understand the underlying pieces so you can make more informed decisions.
Availability
The biggest requirement of key management is availability. The analog I use when talking about this is DNS. Nobody runs with a single DNS server in their environment. (I hope!) You have multiple replicating DNS servers “just in case” something goes wrong. Maybe in your single site you’ll have at least two and maybe three or four. Maybe you even have a DNS server or two running in a cloud or have servers at another site, again, “just in case”. Why? Because if DNS is down then everything is essentially dead in the water. All roads lead to DNS, all roads can end if there’s no DNS.
The same holds true for key management. If the key management infrastructure is down, you can’t encrypt new VM’s or re-key existing VM’s! Even more importantly, you DON’T want that single point of failure. If you have just one KMS and something bad happens to it and you can’t recover the keys then you have some serious issues to attend to! There are no back doors to decrypt a VM. If you lose the keys, you’ve lost the data unless you’ve backed it up. See “Resources” below where James Doyle from our GSS organization did a talk on this at VMworld. A much watch!
Business Critical Infrastructure
This is why Key Management will become the next (critical) datacenter infrastructure requirement, just like DNS and NTP have become. This isn’t a “I need a KMS for vCenter” discussion as much as it is a “I need a KMS for the business” discussion.
After all, you don’t install DNS to make it easier to run just the datacenter. You run DNS because without it the business won’t run. Today you may only need a KMS for vSphere but going forward the business may need it for a whole host of things. Encrypted VM’s might be your first need for a KMS but it won’t be your last.
Note how I didn’t list other requirements like HSM’s (Hardware Security Modules). vSphere is just a KMIP client. These functions are handled by the Key Manager you choose. If you want HSM’s then the Key Manager will talk to them and vSphere will talk to the KMS. Honestly, this lessens complexity while providing you the best choice to meet your needs.
Key Manager Basics
There are a few things to set up in vCenter that are covered in the
documentation but first, let’s go over some of the concepts and terms.
KMIP = Key Management Interoperability Protocol. This is the standard where any KMIP client can speak to a KMIP KMS and it should just work. Extensive testing is done each year at the RSA Conference. You can learn more at
Wikipedia which actually has a great writeup on the whole process.
DEK = Data Encryption Key. This is the key that the ESXi host generates when you encrypt a VM. This is used to encrypt the data and is stored, encrypted, in the VMX/VM Advanced settings.
KEK = Key Encryption Key. This is the key from the KMS that encrypted the DEK. I pointer to the KMS Cluster and the KEK key ID are in the VMX/VM Advanced settings.
Key Manager Cluster/Alias = In vSphere this is a collection of the FQDN/IP addresses of replicating key managers. The name of the cluster is stored, along with the key ID of the KEK, with each VM. When a VM is powered on, this is what vCenter uses to retrieve the correct KEK to unlock the VM.
Don’t get too wrapped up in the term “cluster”. It’s really a collection of FQDN’s and/or IP addresses that represent a group of replicating Key Managers. That’s it.
VM or vSAN encryption, it’s all the same libraries. Note: Whether you are using VM Encryption or vSAN Encryption the configuration (and libraries and crypto) are all the same. The use the same interfaces to connect to the KMS and the same crypto libraries to encrypt or decrypt.
Key Manager Cluster/Alias
Key Managers today are usually set up in a way that they replicate keys to one another. If I have three instances of a key manager, KMS-A, KMS-B and KMS-C, they replicate the keys between them. If I create a key on KMS-A it will show up in KMS-B & KMS-C at some point. You’ll have to check with your key manager product to see how quickly the replication happens.
Using the example above, in vCenter I would
create a key manager cluster/alias. In my example I’ll call it “KMSCluster” and add KMS-A, KMS-B and KMS-C into that KMS Cluster. I would then establish a trust with each of the key managers. There are multiple ways to
establish trust and I would suggest you consult the documentation for your key manager on how best to do this.
Because vSphere currently uses a Key Management Operability Protocol 1.1 compliant library, any KMIP 1.1 key manager should work. For more information on which solutions have been certified go to VMware’s Hardware Compatibility List and search for the category of “kms”.
Key Manager connection retry
Using the example above, if KMS-A is unavailable then vCenter will try KMS-B (next KMS in order). If KMS-A is up but the KMS service, for some reason, doesn’t respond, then vCenter will wait 60 seconds for a response. If after that 60 seconds there is no response, vCenter will try KMS-B. If KMS-A doesn’t respond (e.g. no IP connection at all) then vCenter won’t wait the 60 seconds and will try KMS-B immediately.
The maximum vCenter will wait for a KMS to respond is 60 seconds. This is not something that can be configured.
Separation of duties (when you can)
Ideally, you would want to have a separation of duties when it comes to key management. Large organizations that have a security team may want to control the keys and the IT admin would just use the key manager as a service.
Understandably, not every organization has that ability so for those IT admins that are also the network/storage/security/chief cook/bottle washer types, what you want to consider doing is ensuring only your most trusted admins can administer the key managers and that you have sufficient controls in place to ensure isolation from other network traffic if possible. If you have a management cluster they this might be a good place to run your KMS VM’s, if that’s what you are using. The goal is that only a select few administrators should be able to affect any change on these systems. Off the top of my head, I can’t think of a good business reason for a junior administrator to have any access.
Note about running some of your KMS’s on a management cluster: One thing to keep in mind is to avoid the Catch-22/Chicken&Egg scenario. What I mean by that is that especially with Single Site configurations, you really should have a separate infrastructure for your key management.
For example, running all your key manager VM’s solely on an encrypted vSAN is not ideal. With vSAN Encryption you configure vCenter and then the hosts communicate with the KMS. If you think about it, do you really want the box that holds your key locked in a box that can only be opened by the key inside the inner box? No, you don’t. Will it work? Not after a power cycle. Is it a good practice? No, it’s not. Don’t do this.
You REALLY would need to have at least one of your key managers running in a cloud. Remember, availability, availability, availability. Never willingly paint yourself into a corner. Never have a single point of failure. All your KMS VM’s on an encrypted vSAN that gets its keys from those KMS systems is not good.
The same is true for vCenter and PSC’s in a VM Encryption scenario. You shouldn’t encrypt them using VM Encryption because they would then need to boot up to get their encryption key to boot up. It’s best to run them in a separate management cluster. THOSE VM’s you COULD run on vSAN encryption and that is supported. (because the hosts themselves communicate with the KMS in a vSAN Encryption model)
Topologies
Single Site
As mentioned above, if you have a single site then you probably want to have, at minimum, two replicating key managers. Ideally, you probably want at least three so that at no time are you relying on a single key manager instance. In some cases, as I pointed out above, you may want to run one of your replicating key managers in a cloud service. Some of the certified key managers we work with allow for that type of capability.
What we’re looking for here it the avoidance of a single point of failure within your site. For example, don’t allow all your KMS VM’s (if that’s what you’re using) to run on the same host. Use Host Affinity Rules to keep them separate, ensuring that someone tripping over the power cord doesn’t cause an outage. Think through the scenarios for your environment. Come up with disasters that “could” happen and plan accordingly.
Multi Site
For multi-site, things get more interesting. You want all the KMS servers running and replicating across the sites. Within vCenter, you’ve created your KMS Cluster/Alias and you start adding your FQDN’s or IP’s.
KMS Order
But there’s something to take into consideration. And that’s the order in which you add those KMS. vCenter will try the first KMS on the list.
In the scenario below I have two sites, Site-A and Site-B. In Site-A I have KMS-A, KMS-B and KMS-C. In Site-B I have KMS-D, KMS-E and KMS-F.
In Site-A I want the order to go A,B,C,D,E,F
In Site-B I want the order to go D,E,F,A,B,C
Why? As mentioned, if vCenter doesn’t reach that KMS for 60 seconds it will try the next KMS on the list. What you don’t want to set up is a situation whereby vCenter is going to a remote KMS before exhausting all of the local KMS’s.
I don’t want either site to have to get a key from the remote site unless absolutely necessary. In my example I haven’t added in any additional latencies or other network configurations that may effect the transit of the key from one to the other.
What if?
At VMworld, one of our GSS guys, James Doyle, has really jumped on the encryption bandwagon and did a fantastic presentation on encryption from a Day 2/disaster strikes standpoint.
As you can imagine, GSS gets a lot of customers calling up that might be in a the middle of an IT disaster. With those scenarios in mind, James added a “What if they were using VM Encryption?” to the mix.
Imagine the scenario where you DO have just one KMS system and a bunch of encrypted VM’s and Murphy strikes and you’re left with a bunch of running encrypted VM’s but no KMS and no way to every get those keys back.
Don’t panic. Whatever you do, do NOT shut things down!
James will go in to detail on how you can recover from many of these situations. I can’t recommend this high enough.
Wrap-Up
I hope this has helped give you a better understanding of KMS topologies and the needs and caveats when configuring them and vCenter. This is not meant to be the be-all/end-all for this topic. You should absolutely discuss this with your KMS vendor any specific recommendations they have and incorporate it into your design.
mike
Related