[ #SharePoint2016 ] Planning and architecting Distributed cache Service

As more and more customers are looking at and planning to deploy SharePoint 2016 on-premises infrastructures, as RTM bits are already available, it’s an ideal time to get a precise look of how our old habits should be challenged with the new version.

Usually it’s a matter of finding new documentation on the new version, if not yet available, trying to understand whether things have evolved significantly from previous versions and extrapolate new rules for the new version. So in many aspects the finding exposed here need some disclaimers as it is true up to the point, a new documentation will supersede it. In any way, if this is the case, I will update the post accordingly.

Here are my key findings about Distributed Cache with SharePoint 2016:

Distributed Cache architecture did not significantly change in SharePoint 2016 versus SharePoint 2013
However Distributed Cache can significantly benefit from the MinRole feature of SharePoint 2016
Up to 10,000 users, you can collocate Distributed Cache service on Web Front End servers. Even if you will benefit from dedicated servers below this limit.
The Distributed Cache cluster for SharePoint cannot (still) be configured for High Availability.
For better availability, three hosts are the minimum optimal configuration
Cache server (still) do not need more than 16 GB of RAM
Administrator should (still) be careful

Before detailing these points, here are some contextual information.

1°) What is the purpose of Distributed Cache?

In SharePoint 2013 (and implicitly in SharePoint 2016) “the microblog features and feeds rely on the Distributed Cache to store data for very fast retrieval across all entities. The Distributed Cache service is built on Windows Server AppFabric, which implements the AppFabric Caching service. Windows Server AppFabric installs with the prerequisites for SharePoint Server 2013.” (from Overview of microblog features, feeds, and the Distributed Cache service in SharePoint Server 2013).

This figure below coming from this page, gives a good high level vision on how DC interact with Content DB and feeds. It’s interesting to note here that, according to this schema, writes to the Content Databases occur BEFORE writes to the Distributed Cache which is coherent with the fact that DC is a cache structure and that we could loose it without loosing information. (more on this point at §4 on availability).

So Microblog features and feeds cache are the main feature that rely on Distributed Cache. However it’s not at all the only one. We can find on this same page a complete list of different caches that depend on the Distributed Cache service:

Login Token Cache
Activity Feed Cache
Activity Last Modified Time Cache
OneNote Throttling / “Bouncer Cache”
Access Cache
Search Query Web Part
Security Trimming Cache
App Access Token Cache
View State Cache
Default Cache

To go deeper in the understanding of caches in SharePoint in general and Distributed cache in particular, I would highly recommend the reading of this very good article:

AppFabric Caching and SharePoint: Concepts and Examples (Part 1) from Josh Gavant / besidethepoint

After that, let’s detail the main points:

2°) Changes in architecture between SharePoint 2013 and SharePoint 2016 ?

On this point, it should be noticed that Distributed Cache is not even mentioned on the New and improved features in SharePoint Server 2016 meaning that there is no big “external” changes from SharePoint 2013 to 2016. However looking a little bit precisely on this, we can find three interesting articles on SharePoint 2016 Distributed Cache improvements:

First one from Bill Baer, Senior Technical Product Manager on SharePoint, which describes in his article Distributed Cache in SharePoint Server 2016 IT Preview how some internal changes to Distributed Cache in SharePoint 2016 improves performance and resiliency

SharePoint Server 2016 IT Preview improves Distributed Cache performance and resiliency through a change which switches off NTLM authentication between SharePoint and the cache cluster; instead relying on encryption of cache data before transport. … This change also allows SharePoint Server 2016 IT Preview Distributed Cache clusters to scale up the number of client connections to help with throughput.

Second one from Spencer Harbar, which clearly shows in SharePoint 2016 Nugget #2: Distributed Cache Size in MinRole Farms that memory allocation for DC service is optimized in case you use the role “Distributed Cache” with the new MinRole option. In this case, the size of memory allocated to the DC Service is now half of 80 percent of the total RAM, which is good !
Third one also from Spencer Harbar, embarked on its very good The Playbook Imperative and Changing the Distributed Cache Service Identity which detail implications and improvments on changing DCS identity

Well the good news is that they have refined the service instance plumbing so that the Distributed Cache service instance exists prior to a server being added as a Distributed Cache host. This means we can set the service identity before we ever add a Distributed Cache host. This works both when using MinRole (by specifying a -ServerRole of DistributedCache) or when using -ServerRoleOptional.

3°) Collocation or Dedicated servers?

For a medium to big farm, that’s a no brainer, you can and should dedicate servers (we will discuss below how many) specifically to the Distributed Cache service. MinRole option in SharePoint 2016 define a specific role for dedicated Distributed Cache and as shown before in the Spencer Harbar article it will automatically improve the default configuration.

So the advantages to dedicate servers to Distributed cache service are:

better resource utilization
best performance
simplified administration and patching
easier reconfiguration in case of problems and emergency

But for a small farm, it’s always difficult to decide what is the better option. In this Capacity planning for the Distributed Cache service (for SharePoint 2013), TechNet clearly put a number at 10,000 users above which Dedicated servers are highly recommended, below that limit co-location can be used especially if you can’t afford the costs of dedicated servers. Here is an extract of the table you can find there:

Deployment size	Small farm	Medium farm	Large farm
Total number of users	< 10,000	< 100,000	< 500,000
Recommended architectural configuration	Dedicated server or co-located on a front-end server	Dedicated server	Dedicated server
Minimum cache hosts per farm	1	1	2

I don’t want to add too much on this collocation topic as Internet is already full of discussions on this but I will just point to one good article on this by a recognized expert describing clearly the benefits of:

Dedicating Servers to Distributed Cache in SharePoint 2013 by Steve Peschka

4°) Distributed cache availability

This is probably one of the most badly understood point about Distributed cache !

This is however clearly stated at different places and especially on the Plan and use the Distributed cache Service in SharePoint Server 2013:

A cache cluster cannot be configured for High Availability

The cache cluster’s cache spans all cache hosts and saves data on each cache host. Data is not duplicated or copied on other cache hosts in the cache cluster

Accessible from here: Manage the Distributed Cache service in SharePoint Server 2013

This is also well described on the Part 2 of Josh Gavant already mentioned article : AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2)

This one is easy – SharePoint (as of March 2013) does not provide any high availability for its caches.

As briefly discussed above, this means that each item and region in SharePoint’s named caches exists only once across all the memory in the cluster. If the server where that item has been stored in memory is lost or shut down ungracefully, that cached item will be lost.

An important point here is that although high-availability can’t be achieved on this service by construction, it’s obviously better to have more than one cache server in you farm (at least to be able to restart the service in case of problems). So two servers could be a good answer but not really. Three is a better number to recommend to your customers. Please not here that it’s not at all a question of cluster quorum as we already stated that SharePoint do NOT provide high-availability for its own cache.

On Spencer Harbar own words, here is how it can by justified:

Of course if we care about availability (note I said availability, not high availability) we need more than one Distributed Cache server. Actually there really isn’t much point to a “distributed” cache if it’s only on one server. But we really should have three. Yes. Three. Not two. Three. AppFabric, the real software that Distributed Cache provides a wrapper for has a cluster quorum model. This means that three hosts are the minimum optimal configuration. However, SharePoint’s implementation does NOT use this quorum, the ConfigDB holds host information. Never the less, you will get the best performance and reliability from three or more servers. And further if you only have two you will hit issues when attempting to gracefully shut down any single server (if you do that properly using the AppFabric cmdlets and NOT the SharePoint one, which doesn’t work). None of that is important to the playbook for changing the service account, but it is extremely important more generally. http://www.harbar.net/archive/2016/03/21/The-Playbook-Imperative-and-Changing-the-Distributed-Cache-Service-Identity.aspx

5°) What is the impact of loosing a Dedicated cache server or the whole service itself?

As discussed at the very beginning of this post, this is generally not a problem for cached items because they are authoritatively stored elsewhere. Nevertheless, there are a couple things to keep in mind.

First, retrieving cached items all over again involves a performance hit, the very hit the caches are intended to help avoid. There could be interruptions and delays while the caches are being refilled. For example, if the ActivityFeed cache is lost, users may not see all recent updates in their Newsfeed, or may see the “We’re still gathering the news” message as the cache is repopulated.

For the ActivityFeed and ActivityFeedLMT cache, there are two PowerShell cmdlets to manually begin repopulation of the caches before users actually request data. These are Update-SPRepopulateMicroblogLMTCache and Update-SPRepopulateMicroblogFeedCache. In situations where maintenance leads to loss of these caches, plan to run these cmdlets immediately afterwards to repopulate data manually.

A second concern when cached data in SharePoint is lost is that some items in SharePoint are *only* stored in the cache; specifically, updates regarding followed documents are only stored in the cache (as of March 2013). If these cached items are lost they won’t be able to be regenerated and will no longer appear in users’ feeds.

Josh Gavant on AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2)

So here appears for the first time the fact that losing a not highly available cache can lead to loose some information.That’s a bad thing (from an architectural point of view) with no so much impact on the end user so we have to live with it.

6°) Server sizing

No change in this area : a typical SharePoint 2016 server will have 16 GB of RAM and the distributed Cache service will not handle correctly more than 16 GB of RAM.

On a server that has more than 16 GB of total physical memory, allocate a maximum of 16 GB of memory to the Distributed Cache service. If you allocate more than 16 GB of memory to the Distributed Cache service, the server might unexpectedly stop responding for more than 10 seconds.

Planning for the Distributed Cache service

7°) Implication on the administration

It should be recalled here that :

To avoid losing items from the cache and/or having to retrieve them again, you can use the Stop-SPDistributedCacheServiceInstance cmdlet with the -Graceful switch. This will move all cached items from the local cache host to other cache hosts in the cluster. For this to be effective, there must be space on the other servers to accommodate these items. Also note that if shutting down the entire cluster, such as to change the cache host size, there’s no way to avoid losing all of the caches and items. Plan accordingly.

Josh Gavant AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2) )

and a very important consideration:

Management of Distributed Cache Service Instances (AppFabric Cache Hosts) in SharePoint is different than management of most SharePoint service instances. … Unlike other service instances, though, the Distributed Cache Service Instance should either be installed *and* online on a SharePoint server, or not installed at all. If the service instance is stopped (disabled) but not uninstalled, details about the associated Cache Host stay in the Cache Cluster Config store, which can cause problems.

Josh Gavant AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2) )

As far administration is concerned, and as it’s out of the scope of this post I can’t finish this post without pointing you to two VERY good additional resources:

The Playbook Imperative and Changing the Distributed Cache Service Identity from Spencer Harbar

If we didn’t have this playbook, we’d have no good chance of creating the run book, or the scripts to implement the run book. Because we have it we can produce a run book and scripts much more easily as we have our essential details and we won’t waste time thrashing out hacks that semi work or have environment specifics hard wired into them.

And

No-nos, Gotchas, Warnings, Best Practices, and Things to Remember for SharePoint 2013 Distributed Cache Service of Nik Patel.

where additional points are also well presented.

Let me know your finding on this topic or any SharePoint 2016 documentation publication that could have an impact on these points.

Hope this will help you to better design you SharePoint 2016 farms !

5 thoughts on “[ #SharePoint2016 ] Planning and architecting Distributed cache Service”

Pingback: Newsletter – Episode 62 – Belarus SharePoint Community Blog
Pierre says:

May 9, 2016 at 3:57 pm

Salut Patrick,

Concernant le 5), comme l’expérience l’a prouvé, exécuter les cmdlets SP-Repopulate n’a aucun effet.
Cela a été démontré à MS à l’occasion d’un (gros) projet.

Concernant le 7), il ne faut pas utiliser la cmdlet SharePoint Stop-SPDistributedCacheServiceInstance -Graceful.
Cela a été admis officiellement dans ce blog :
https://blogs.msdn.microsoft.com/sambetts/2015/03/12/graceful-sharepoint-appfabric-restarts/

“This time we’ll run the graceful shutdown before rebooting, first this command (changing the hostname of course):
Stop-CacheHost -HostName sp15-search-idx.sfb-testnet.local -CachePort 22233 -Graceful
Now I bet that surprised you; in the official SharePoint/AppFabric documentation we’re told to run “Stop-SPDistributedCacheServiceInstance -Graceful”.
However for reasons too complicated to go into here, let’s just say for now that the official stop command is far from graceful – the service is in fact dropped like a hot potato and anything on that host in AppFabric goes with it.”

“I know some may be asking why the official guidelines on graceful shutdowns don’t work by default; let’s just say we’re looking at it.”

Sam Betts explique que le sujet est compliqué et promet un post à ce sujet “un autre jour”, mais toujours rien plus d’1 an après…

Passe le bonjour à Gilles de ma part.

LikeLike

- Patrick Guimonet [MVP - MS RD] says:
  
  June 19, 2016 at 5:16 pm
  
  Merci Pierre pour ces précisions.
  Voici donc où l’on peut trouver la documentation mise à jour :
  Perform a graceful shutdown of the Distributed Cache service by using a Windows PowerShell script
  https://technet.microsoft.com/en-us/library/jj219613.aspx#graceful
  Elle contient donc un appel à Stop-CacheHost -Graceful suivi d’une boucle d’attente pour que le service passe à l’état “down”
  
  LikeLike
  
Filip Bosmans says:

May 26, 2016 at 4:53 pm

Number 7. still mentions to use the default SharePoint command for -graceful yet the article was updated with a script and uses Stop-CacheHost.
https://technet.microsoft.com/en-us/library/jj219613.aspx#graceful
Sam Betts mentions why the original one should not be used anymore:
https://blogs.msdn.microsoft.com/sambetts/2015/03/12/graceful-sharepoint-appfabric-restarts/

Some more info and a script to check the entire Cluster you can find here:
https://blogs.technet.microsoft.com/filipbosmans/2015/12/07/troubleshooting-distributed-cache-for-sharepoint-2013-on-premise/
Also contains links to best practices and explanation on what to do etc.

LikeLike

- Patrick Guimonet [MVP - MS RD] says:
  
  June 19, 2016 at 5:20 pm
  
  Yes you are right Filip, Thanks for comment and links.
  
  LikeLike

aOS 365

News on aOS (Azure – Office 365 – SharePoint) platform

[ #SharePoint2016 ] Planning and architecting Distributed cache Service

1°) What is the purpose of Distributed Cache?

2°) Changes in architecture between SharePoint 2013 and SharePoint 2016 ?

3°) Collocation or Dedicated servers?

4°) Distributed cache availability

5°) What is the impact of loosing a Dedicated cache server or the whole service itself?

6°) Server sizing

7°) Implication on the administration

5 thoughts on “[ #SharePoint2016 ] Planning and architecting Distributed cache Service”

Leave a comment Cancel reply

1°) What is the purpose of Distributed Cache?

2°) Changes in architecture between SharePoint 2013 and SharePoint 2016 ?

3°) Collocation or Dedicated servers?

4°) Distributed cache availability

5°) What is the impact of loosing a Dedicated cache server or the whole service itself?

6°) Server sizing

7°) Implication on the administration

Share this:

Related posts

5 thoughts on “[ #SharePoint2016 ] Planning and architecting Distributed cache Service”

Leave a comment Cancel reply