[ #SharePoint2016 ] Planning and architecting Distributed cache Service

As more and more customers are looking at and planning to deploy SharePoint 2016 on-premises infrastructures, as RTM bits are already available, it’s an ideal time to get a precise look of how our old habits should be challenged with the new version.

Usually it’s a matter of finding new documentation on the new version, if not yet available, trying to understand whether things have evolved significantly from previous versions and extrapolate new rules for the new version. So in many aspects the finding exposed here need some disclaimers as it is true up to the point, a new documentation will supersede it. In any way, if this is the case, I will update the post accordingly.

Here are my key findings about Distributed Cache with SharePoint 2016:

    1. Distributed Cache architecture did not significantly change in SharePoint 2016 versus SharePoint 2013
    2. However Distributed Cache can significantly benefit from the MinRole feature of SharePoint 2016
    3. Up to 10,000 users, you can collocate Distributed Cache service on Web Front End servers. Even if you will benefit from dedicated servers below this limit.
    4. The Distributed Cache cluster for SharePoint cannot (still) be configured for High Availability.
    5. For better availability, three hosts are the minimum optimal configuration
    6. Cache server (still) do not need more than 16 GB of RAM
    7. Administrator should (still) be careful

 

Before detailing these points, here are some contextual information.

1°) What is the purpose of Distributed Cache?

In SharePoint 2013 (and implicitly in SharePoint 2016) “the microblog features and feeds rely on the Distributed Cache to store data for very fast retrieval across all entities. The Distributed Cache service is built on Windows Server AppFabric, which implements the AppFabric Caching service. Windows Server AppFabric installs with the prerequisites for SharePoint Server 2013.” (from Overview of microblog features, feeds, and the Distributed Cache service in SharePoint Server 2013).

This figure below coming from this page, gives a good high level vision on how DC interact with Content DB and feeds. It’s interesting to note here that, according to this schema, writes to the Content Databases occur BEFORE writes to the Distributed Cache which is coherent with the fact that DC is a cache structure and that we could loose it without loosing information. (more on this point at §4 on availability).

Architecture of microblog and dependencies

 

So Microblog features and feeds cache are the main feature that rely on Distributed Cache. However it’s not at all the only one. We can find on this same page a complete list of different caches that depend on the Distributed Cache service:

    • Login Token Cache
    • Activity Feed Cache

    • Activity Last Modified Time Cache

    • OneNote Throttling / “Bouncer Cache”

    • Access Cache

    • Search Query Web Part

    • Security Trimming Cache

    • App Access Token Cache

    • View State Cache

    • Default Cache

To go deeper in the understanding of caches in SharePoint in general and Distributed cache in particular, I would highly recommend the reading of this very good article:

After that, let’s detail the main points:

2°) Changes in architecture between SharePoint 2013 and SharePoint 2016 ?

On  this point, it should be noticed that Distributed Cache is not even mentioned on the New and improved features in SharePoint Server 2016 meaning that there is no big “external” changes from SharePoint 2013 to 2016. However looking a little bit precisely on this, we can find three interesting articles on SharePoint 2016 Distributed Cache improvements:

  • First one from Bill Baer, Senior Technical Product Manager on SharePoint, which describes in his article Distributed Cache in SharePoint Server 2016 IT Preview how some internal changes to Distributed Cache in SharePoint 2016 improves performance and resiliency

SharePoint Server 2016 IT Preview improves Distributed Cache performance and resiliency through a change which switches off NTLM authentication between SharePoint and the cache cluster; instead relying on encryption of cache data before transport.  … This change also allows SharePoint Server 2016 IT Preview Distributed Cache clusters to scale up the number of client connections to help with throughput.

ContentandSearchFarm

Well the good news is that they have refined the service instance plumbing so that the Distributed Cache service instance exists prior to a server being added as a Distributed Cache host. This means we can set the service identity before we ever add a Distributed Cache host. This works both when using MinRole (by specifying a -ServerRole of DistributedCache) or when using  -ServerRoleOptional.

3°) Collocation or Dedicated servers?

For a medium to big farm, that’s a no brainer, you can and should dedicate servers (we will discuss below how many) specifically to the Distributed Cache service. MinRole option in SharePoint 2016 define a specific role for dedicated Distributed Cache and as shown before in the Spencer Harbar article it will automatically improve the default configuration.

So the advantages to dedicate servers to Distributed cache service are:

    • better resource utilization
    • best performance
    • simplified administration and patching
    • easier reconfiguration in case of problems and emergency

But for a small farm, it’s always difficult to decide what is the better option. In this Capacity planning for the Distributed Cache service (for SharePoint 2013), TechNet clearly put a number at 10,000 users above which Dedicated servers are highly recommended, below that limit co-location can be used especially if you can’t afford the costs of dedicated servers. Here is an extract of the table you can find there:

    Deployment size Small farm Medium farm Large farm

    Total number of users

    < 10,000

    < 100,000

    < 500,000

    Recommended architectural configuration

    Dedicated server or co-located on a front-end server

    Dedicated server

    Dedicated server

    Minimum cache hosts per farm

    1

    1

    2

I don’t want to add too much on this collocation topic as Internet is already full of discussions on this but I will just point to one good article on this by a recognized expert describing clearly the benefits of:

     

4°) Distributed cache availability

This is probably one of the most badly understood point about Distributed cache !

This is however clearly stated at different places and especially on the Plan and use the Distributed cache Service in SharePoint Server 2013:

A cache cluster cannot be configured for High Availability

The cache cluster’s cache spans all cache hosts and saves data on each cache host. Data is not duplicated or copied on other cache hosts in the cache cluster

image

Accessible from here: Manage the Distributed Cache service in SharePoint Server 2013

This is also well described on the Part 2 of Josh Gavant already mentioned article : AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2)

This one is easy – SharePoint (as of March 2013) does not provide any high availability for its caches.

As briefly discussed above, this means that each item and region in SharePoint’s named caches exists only once across all the memory in the cluster. If the server where that item has been stored in memory is lost or shut down ungracefully, that cached item will be lost.

An important point here is that although high-availability can’t be achieved on this service by construction, it’s obviously better to have more than one cache server in you farm (at least to be able to restart the service in case of problems). So two servers could be a good answer but not really. Three is a better number to recommend to your customers. Please not here that it’s not at all a question of cluster quorum as we already stated that SharePoint do NOT provide high-availability for its own cache.

On Spencer Harbar own words, here is how it can by justified:

Of course if we care about availability (note I said availability, not high availability) we need more than one Distributed Cache server. Actually there really isn’t much point to a “distributed” cache if it’s only on one server. But we really should have three. Yes. Three. Not two. Three. AppFabric, the real software that Distributed Cache provides a wrapper for has a cluster quorum model. This means that three hosts are the minimum optimal configuration. However, SharePoint’s implementation does NOT use this quorum, the ConfigDB holds host information. Never the less, you will get the best performance and reliability from three or more servers. And further if you only have two you will hit issues when attempting to gracefully shut down any single server (if you do that properly using the AppFabric cmdlets and NOT the SharePoint one, which doesn’t work). None of that is important to the playbook for changing the service account, but it is extremely important more generally. http://www.harbar.net/archive/2016/03/21/The-Playbook-Imperative-and-Changing-the-Distributed-Cache-Service-Identity.aspx

 

5°) What is the impact of loosing a Dedicated cache server or the whole service itself?

 

As discussed at the very beginning of this post, this is generally not a problem for cached items because they are authoritatively stored elsewhere. Nevertheless, there are a couple things to keep in mind.

First, retrieving cached items all over again involves a performance hit, the very hit the caches are intended to help avoid. There could be interruptions and delays while the caches are being refilled. For example, if the ActivityFeed cache is lost, users may not see all recent updates in their Newsfeed, or may see the “We’re still gathering the news” message as the cache is repopulated.

For the ActivityFeed and ActivityFeedLMT cache, there are two PowerShell cmdlets to manually begin repopulation of the caches before users actually request data. These are Update-SPRepopulateMicroblogLMTCache and Update-SPRepopulateMicroblogFeedCache. In situations where maintenance leads to loss of these caches, plan to run these cmdlets immediately afterwards to repopulate data manually.

A second concern when cached data in SharePoint is lost is that some items in SharePoint are *only* stored in the cache; specifically, updates regarding followed documents are only stored in the cache (as of March 2013). If these cached items are lost they won’t be able to be regenerated and will no longer appear in users’ feeds.

Josh Gavant on AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2)

So here appears for the first time the fact that losing a not highly available cache can lead to loose some information.That’s a bad thing (from an architectural point of view) with no so much impact on the end user so we have to live with it.

6°) Server sizing

 

No change in this area : a typical SharePoint 2016 server will have 16 GB of RAM and the distributed Cache service will not handle correctly more than 16 GB of RAM.

On a server that has more than 16 GB of total physical memory, allocate a maximum of 16 GB of memory to the Distributed Cache service. If you allocate more than 16 GB of memory to the Distributed Cache service, the server might unexpectedly stop responding for more than 10 seconds. 

Planning for the Distributed Cache service

7°) Implication on the administration

It should be recalled here that :

To avoid losing items from the cache and/or having to retrieve them again, you can use the Stop-SPDistributedCacheServiceInstance cmdlet with the -Graceful switch. This will move all cached items from the local cache host to other cache hosts in the cluster. For this to be effective, there must be space on the other servers to accommodate these items. Also note that if shutting down the entire cluster, such as to change the cache host size, there’s no way to avoid losing all of the caches and items. Plan accordingly.

Josh Gavant  AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2) )

    and a very important consideration:

Management of Distributed Cache Service Instances (AppFabric Cache Hosts) in SharePoint is different than management of most SharePoint service instances. … Unlike other service instances, though, the Distributed Cache Service Instance should either be installed *and* online on a SharePoint server, or not installed at all. If the service instance is stopped (disabled) but not uninstalled, details about the associated Cache Host stay in the Cache Cluster Config store, which can cause problems.

Josh Gavant  AppFabric Caching (and SharePoint): Configuration and Deployment (Part 2) )

As far administration is concerned, and as it’s out of the scope of this post I can’t finish this post without pointing you to two VERY good additional resources:

If we didn’t have this playbook, we’d have no good chance of creating the run book, or the scripts to implement the run book. Because we have it we can produce a run book and scripts much more easily as we have our essential details and we won’t waste time thrashing out hacks that semi work or have environment specifics hard wired into them.

And

where additional points are also well presented.

 

Let me know your finding on this topic or any SharePoint 2016 documentation publication that could have an impact on these points.

 

Hope this will help you to better design you SharePoint 2016 farms !

    Advertisements

    5 thoughts on “[ #SharePoint2016 ] Planning and architecting Distributed cache Service

    1. Pingback: Newsletter – Episode 62 – Belarus SharePoint Community Blog

    2. Salut Patrick,

      Concernant le 5), comme l’expérience l’a prouvé, exécuter les cmdlets SP-Repopulate n’a aucun effet.
      Cela a été démontré à MS à l’occasion d’un (gros) projet.

      Concernant le 7), il ne faut pas utiliser la cmdlet SharePoint Stop-SPDistributedCacheServiceInstance -Graceful.
      Cela a été admis officiellement dans ce blog :
      https://blogs.msdn.microsoft.com/sambetts/2015/03/12/graceful-sharepoint-appfabric-restarts/

      “This time we’ll run the graceful shutdown before rebooting, first this command (changing the hostname of course):
      Stop-CacheHost -HostName sp15-search-idx.sfb-testnet.local -CachePort 22233 -Graceful
      Now I bet that surprised you; in the official SharePoint/AppFabric documentation we’re told to run “Stop-SPDistributedCacheServiceInstance -Graceful”.
      However for reasons too complicated to go into here, let’s just say for now that the official stop command is far from graceful – the service is in fact dropped like a hot potato and anything on that host in AppFabric goes with it.”

      “I know some may be asking why the official guidelines on graceful shutdowns don’t work by default; let’s just say we’re looking at it.”

      Sam Betts explique que le sujet est compliqué et promet un post à ce sujet “un autre jour”, mais toujours rien plus d’1 an après…

      Passe le bonjour à Gilles de ma part.

      Like

    3. Number 7. still mentions to use the default SharePoint command for -graceful yet the article was updated with a script and uses Stop-CacheHost.
      https://technet.microsoft.com/en-us/library/jj219613.aspx#graceful
      Sam Betts mentions why the original one should not be used anymore:
      https://blogs.msdn.microsoft.com/sambetts/2015/03/12/graceful-sharepoint-appfabric-restarts/

      Some more info and a script to check the entire Cluster you can find here:
      https://blogs.technet.microsoft.com/filipbosmans/2015/12/07/troubleshooting-distributed-cache-for-sharepoint-2013-on-premise/
      Also contains links to best practices and explanation on what to do etc.

      Like

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s