As more and more customers are looking at and planning to deploy SharePoint 2016 on-premises infrastructures, as RTM bits are already available, it’s an ideal time to get a precise look of how our old habits should be challenged with the new version.
Usually it’s a matter of finding new documentation on the new version, if not yet available, trying to understand whether things have evolved significantly from previous versions and extrapolate new rules for the new version. So in many aspects the finding exposed here need some disclaimers as it is true up to the point, a new documentation will supersede it. In any way, if this is the case, I will update the post accordingly.
Here are my key findings about Distributed Cache with SharePoint 2016:
Before detailing these points, here are some contextual information.
1°) What is the purpose of Distributed Cache?
In SharePoint 2013 (and implicitly in SharePoint 2016) “the microblog features and feeds rely on the Distributed Cache to store data for very fast retrieval across all entities. The Distributed Cache service is built on Windows Server AppFabric, which implements the AppFabric Caching service. Windows Server AppFabric installs with the prerequisites for SharePoint Server 2013.” (from Overview of microblog features, feeds, and the Distributed Cache service in SharePoint Server 2013).
This figure below coming from this page, gives a good high level vision on how DC interact with Content DB and feeds. It’s interesting to note here that, according to this schema, writes to the Content Databases occur BEFORE writes to the Distributed Cache which is coherent with the fact that DC is a cache structure and that we could loose it without loosing information. (more on this point at §4 on availability).
So Microblog features and feeds cache are the main feature that rely on Distributed Cache. However it’s not at all the only one. We can find on this same page a complete list of different caches that depend on the Distributed Cache service:
- Login Token Cache
Activity Feed Cache Activity Last Modified Time Cache OneNote Throttling / “Bouncer Cache” Access Cache Search Query Web Part Security Trimming Cache App Access Token Cache View State Cache Default Cache
To go deeper in the understanding of caches in SharePoint in general and Distributed cache in particular, I would highly recommend the reading of this very good article:
After that, let’s detail the main points:
2°) Changes in architecture between SharePoint 2013 and SharePoint 2016 ?
On this point, it should be noticed that Distributed Cache is not even mentioned on the New and improved features in SharePoint Server 2016 meaning that there is no big “external” changes from SharePoint 2013 to 2016. However looking a little bit precisely on this, we can find three interesting articles on SharePoint 2016 Distributed Cache improvements:
- First one from Bill Baer, Senior Technical Product Manager on SharePoint, which describes in his article Distributed Cache in SharePoint Server 2016 IT Preview how some internal changes to Distributed Cache in SharePoint 2016 improves performance and resiliency
SharePoint Server 2016 IT Preview improves Distributed Cache performance and resiliency through a change which switches off NTLM authentication between SharePoint and the cache cluster; instead relying on encryption of cache data before transport. … This change also allows SharePoint Server 2016 IT Preview Distributed Cache clusters to scale up the number of client connections to help with throughput.
- Second one from Spencer Harbar, which clearly shows in SharePoint 2016 Nugget #2: Distributed Cache Size in MinRole Farms that memory allocation for DC service is optimized in case you use the role “Distributed Cache” with the new MinRole option. In this case, the size of memory allocated to the DC Service is now half of 80 percent of the total RAM, which is good !
- Third one also from Spencer Harbar, embarked on its very good The Playbook Imperative and Changing the Distributed Cache Service Identity which detail implications and improvments on changing DCS identity
Well the good news is that they have refined the service instance plumbing so that the Distributed Cache service instance exists prior to a server being added as a Distributed Cache host. This means we can set the service identity before we ever add a Distributed Cache host. This works both when using MinRole (by specifying a -ServerRole of DistributedCache) or when using -ServerRoleOptional.
3°) Collocation or Dedicated servers?
For a medium to big farm, that’s a no brainer, you can and should dedicate servers (we will discuss below how many) specifically to the Distributed Cache service. MinRole option in SharePoint 2016 define a specific role for dedicated Distributed Cache and as shown before in the Spencer Harbar article it will automatically improve the default configuration.
So the advantages to dedicate servers to Distributed cache service are:
- better resource utilization
- best performance
- simplified administration and patching
- easier reconfiguration in case of problems and emergency
But for a small farm, it’s always difficult to decide what is the better option. In this Capacity planning for the Distributed Cache service (for SharePoint 2013), TechNet clearly put a number at 10,000 users above which Dedicated servers are highly recommended, below that limit co-location can be used especially if you can’t afford the costs of dedicated servers. Here is an extract of the table you can find there:
|Deployment size||Small farm||Medium farm||Large farm|
Total number of users
Recommended architectural configuration
Dedicated server or co-located on a front-end server
Minimum cache hosts per farm
I don’t want to add too much on this collocation topic as Internet is already full of discussions on this but I will just point to one good article on this by a recognized expert describing clearly the benefits of:
4°) Distributed cache availability
This is probably one of the most badly understood point about Distributed cache !
This is however clearly stated at different places and especially on the Plan and use the Distributed cache Service in SharePoint Server 2013:
A cache cluster cannot be configured for High Availability
The cache cluster’s cache spans all cache hosts and saves data on each cache host. Data is not duplicated or copied on other cache hosts in the cache cluster
Accessible from here: Manage the Distributed Cache service in SharePoint Server 2013
This is also well described on the Part 2 of Josh Gavant already mentioned article : AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2)
This one is easy – SharePoint (as of March 2013) does not provide any high availability for its caches.
As briefly discussed above, this means that each item and region in SharePoint’s named caches exists only once across all the memory in the cluster. If the server where that item has been stored in memory is lost or shut down ungracefully, that cached item will be lost.
An important point here is that although high-availability can’t be achieved on this service by construction, it’s obviously better to have more than one cache server in you farm (at least to be able to restart the service in case of problems). So two servers could be a good answer but not really. Three is a better number to recommend to your customers. Please not here that it’s not at all a question of cluster quorum as we already stated that SharePoint do NOT provide high-availability for its own cache.
On Spencer Harbar own words, here is how it can by justified:
Of course if we care about availability (note I said availability, not high availability) we need more than one Distributed Cache server. Actually there really isn’t much point to a “distributed” cache if it’s only on one server. But we really should have three. Yes. Three. Not two. Three. AppFabric, the real software that Distributed Cache provides a wrapper for has a cluster quorum model. This means that three hosts are the minimum optimal configuration. However, SharePoint’s implementation does NOT use this quorum, the ConfigDB holds host information. Never the less, you will get the best performance and reliability from three or more servers. And further if you only have two you will hit issues when attempting to gracefully shut down any single server (if you do that properly using the AppFabric cmdlets and NOT the SharePoint one, which doesn’t work). None of that is important to the playbook for changing the service account, but it is extremely important more generally. http://www.harbar.net/archive/2016/03/21/The-Playbook-Imperative-and-Changing-the-Distributed-Cache-Service-Identity.aspx
5°) What is the impact of loosing a Dedicated cache server or the whole service itself?
As discussed at the very beginning of this post, this is generally not a problem for cached items because they are authoritatively stored elsewhere. Nevertheless, there are a couple things to keep in mind.
First, retrieving cached items all over again involves a performance hit, the very hit the caches are intended to help avoid. There could be interruptions and delays while the caches are being refilled. For example, if the ActivityFeed cache is lost, users may not see all recent updates in their Newsfeed, or may see the “We’re still gathering the news” message as the cache is repopulated.
For the ActivityFeed and ActivityFeedLMT cache, there are two PowerShell cmdlets to manually begin repopulation of the caches before users actually request data. These are Update-SPRepopulateMicroblogLMTCache and Update-SPRepopulateMicroblogFeedCache. In situations where maintenance leads to loss of these caches, plan to run these cmdlets immediately afterwards to repopulate data manually.
A second concern when cached data in SharePoint is lost is that some items in SharePoint are *only* stored in the cache; specifically, updates regarding followed documents are only stored in the cache (as of March 2013). If these cached items are lost they won’t be able to be regenerated and will no longer appear in users’ feeds.
So here appears for the first time the fact that losing a not highly available cache can lead to loose some information.That’s a bad thing (from an architectural point of view) with no so much impact on the end user so we have to live with it.
6°) Server sizing
No change in this area : a typical SharePoint 2016 server will have 16 GB of RAM and the distributed Cache service will not handle correctly more than 16 GB of RAM.
On a server that has more than 16 GB of total physical memory, allocate a maximum of 16 GB of memory to the Distributed Cache service. If you allocate more than 16 GB of memory to the Distributed Cache service, the server might unexpectedly stop responding for more than 10 seconds.
7°) Implication on the administration
It should be recalled here that :
To avoid losing items from the cache and/or having to retrieve them again, you can use the Stop-SPDistributedCacheServiceInstance cmdlet with the -Graceful switch. This will move all cached items from the local cache host to other cache hosts in the cluster. For this to be effective, there must be space on the other servers to accommodate these items. Also note that if shutting down the entire cluster, such as to change the cache host size, there’s no way to avoid losing all of the caches and items. Plan accordingly.
and a very important consideration:
Management of Distributed Cache Service Instances (AppFabric Cache Hosts) in SharePoint is different than management of most SharePoint service instances. … Unlike other service instances, though, the Distributed Cache Service Instance should either be installed *and* online on a SharePoint server, or not installed at all. If the service instance is stopped (disabled) but not uninstalled, details about the associated Cache Host stay in the Cache Cluster Config store, which can cause problems.
As far administration is concerned, and as it’s out of the scope of this post I can’t finish this post without pointing you to two VERY good additional resources:
If we didn’t have this playbook, we’d have no good chance of creating the run book, or the scripts to implement the run book. Because we have it we can produce a run book and scripts much more easily as we have our essential details and we won’t waste time thrashing out hacks that semi work or have environment specifics hard wired into them.
- No-nos, Gotchas, Warnings, Best Practices, and Things to Remember for SharePoint 2013 Distributed Cache Service of Nik Patel.
where additional points are also well presented.
Let me know your finding on this topic or any SharePoint 2016 documentation publication that could have an impact on these points.
Hope this will help you to better design you SharePoint 2016 farms !