Tuesday, January 12, 2021

Field Notes: SQL Server Replication


Field Notes - SQL Server Replication

This post is not a how to guide about replication. 

For this post I am trying something a little different. I've used replication in the past, recently I've spent a bit more time with it and I've got to be honest I'm now slightly more in favour of using it than I was before (I said slightly). This post is a collection of notes; things that I've tried, some of the problems that I have ran into and general thoughts. There's a lot of links to some really useful information out there (which is where you'll find the how-to stuff - such as the replication tutorials from Microsoft).

From the off, it goes without saying but I'll say it anyway; whatever you do in replication don't let the first time you try it out be in a production environment - it will end in disaster, I personally guarantee it!!

Accurate Replication Diagram

Replication is both extremely configurable and at times a seriously complex topic. Ideally you need a blank development canvas to setup replication systems on from scratch as well as a test system that resembles your production replication topology as closely as possible (and yes, that means clusters, AG's whatever).  I learnt very quickly that replication goes way beyond copying a table from A to B rather quickly and can often leave a system resembling the image above - so having the opportunity to try everything out in isolation a prior to production is a priceless benefit. Right, on with the notes:

Snapshot Initialisation Options

A problem came when we needed to rebuild the entire replication setup using the same databases. We couldn't reinitialise from a new snapshot so instead setup the subscription using the 'replication support only' option. When using this option the export and import of snapshot data is skipped and only replication support stored procedures and triggers are created at the subscriber before data synchronisation starts. 

The critical point when using this option is to ensure that the schema between publisher and subscriber match and that data is completely up to date in the subscriber as only changes from the start of synchronisation are replicated (see Subscription DML Errors further down for potential issues). Another option to reduce the impact of a large snapshot is to initialise with a database backup instead; here's a handy guide on how to do that.

Be aware though that non-sync subscription options such as replication support only also prevent the subscription from being reinitialised at any point; should the subscription expire for example then the subscription would need dropping and recreated.

Full Snapshots When Adding Tables - Immediate Sync Option

By default subscriptions are created with the immediate sync option equal to true, a downside to using this option occurs when adding a new table to a publication as it requires generation of the entire snapshot. Depending on database and publication size this can be an extremely long winded process. Setting the immediate sync option to false means a snapshot is only created for the new table when the snapshot agent is ran significantly reducing the size, resource cost of course and time for snapshot generation; more information here.

An alternative approach is to use multiple publications within the same published database, which I will cover shortly.

Trimming Down Publications

It's very easy to publish all tables to a subscriber, in fact it's quite common to see that approach but the obvious problem here is replicating way too much data than what you actually need. A trimmed down publication containing only what is required can provide significant performance improvements, particularly where snapshots are concerned (as well as being rather obvious best practice advice).

Similarly performing needless update operations on a published database can in turn cause those same needless operations to be replicated across to its subscribers.

Trimming down doesn't just apply to data though. Publications are also seriously configurable; objects such as the individual columns, foreign and unique key constraints, partitioning schemes, permissions and so much more can be replicated (or not) along with their respected tables and data. 

Both clustered and nonclustered indexes can be replicated to a subscriber (as well as spatial, full text and nonclustered columnstore). Indexes though can get very large and replicating them comes at a cost, a cost which if the index isn't used at the subscriber will be a total waste so consider disabling the replication of certain objects if they are not required. Here's an overview of article properties to get you started.

Publications can even take advantage of filters, essentially a where clause on the replicated object which can further minimise the footprint of a publication. This can be a really useful not just for performance reasons but also as a design consideration, particularly where multiple subscribers are present that could have their relevant data split out and filtered into separate publications, again the pros and cons to any design need weighing up. 

Remember it's much easier to carefully plan replication at the beginning of an implementation than it is to start making lots of changes once it is in place. Understanding (and sticking to) the exact requirements (and knowing the limitations) can often avoid problems later down the line.

Multiple Publications - Other Examples

It's common to split up publications into logical groups of objects, not just of object types. Again it is a design consideration; as an example, perhaps certain objects only need infrequent synchronisation so can be placed in a separate publication and scheduled accordingly. Large objects can also be put into their own publications 

Another configurable option within replication is command types; I have worked on systems where delete commands are not replicated to their archive subscriber databases. This works well but isn't without potential issues (as I'll mention soon). Whilst these command configurations can be made at the article layer it's another potential opportunity to split out a group of published data.

Just bear in mind that splitting out published data creates lots more to do; more publications means more agents, jobs, and data considerations - whilst it is possible it is advised not to include the same objects in multiple publications for performance reasons.

Distribution Performance - Indexing

Replication uses a lot of system tables to keep track of everything that is going on and the distribution database can get very busy, often the underlying system queries can be improved by implementing additional indexes within the distribution database. Here's a great post from Phil Grayson containing a script that I have implemented numerous times. I highly recommend treating the distribution database as you would any other and monitor for poor performance, particularly at the query layer.

Large Updates Causing Blocking (Merge Replication) - Generation Leveling Threshold

When you add large batch updates into the mix along with merge replication things can start to slowly grind to a halt with a common symptom being merge agents appearing to be in a hung state. 

In my experience I've seen issue show up with long blocking chains being present involving system tables such as MSmerge_tombstone and MSmerge_genhistory. One option is to alter the publications generation_leveling_threshold setting which alters the group size of replicated transactions and greatly improve the performance of merge synchronization; here's a link with more informationAlso worth noting, the website linked to is bloody marvelous for replication guidance - bookmark it.

Flexibility Over Availability Groups

This one is a little bit of a shocker for me to say; in some ways I actually prefer replication over Availability Groups! AG's can be a pain to manage but one of my biggest problems with them is the lack of flexibility when it comes to readable replicas, because basically there isn't any. 

Using transactional replication means I can take of advantage of being able to create and use additional objects at the subscriber database (which I cannot do in AG's). Subscriber workloads tend to be different to the workloads on the publisher, different workloads mean different queries and that means I can do things like use completely different indexing strategies at the subscriber (even better if I've already trimmed down the excess).

Flexibility in the subscriber database is great, but there are some things to think about...

Subscriber Performance Considerations

I mentioned differing workloads and a typical example setup of a transactional publisher replicating to a subscriber used for reporting purposes. Subscriber performance can often be overlooked but is equally critical to publisher and distribution performance because it can have a direct affect on replication. 

This is a really important consideration for me; treat a replication topology as a whole system, not individual parts. They can all impact one other, and often do, so should be monitored and subsequently tuned accordingly.

A common scenario that I have seen is larger reporting queries (heavy selects) being ran on the subscriber blocking replication update commands. These are reported in replication monitor as query timeouts (no activity for 10 minutes) but often this error indicates issues at the subscriber. Replication is then queued up until the block resolves, for significant duration this then causes a delay whilst replication catches up. 

Kendra Little wrote a fantastic replication tuning checklist covering lots of performance considerations. One of the mentions was about using RCSI at the subscriber database; I can't say at the time of writing that I have used in production yet but is definitely something I will be exploring on new implementations.

Subscriber DML Errors - Row was not found at the subscriber

Earlier I mentioned the replication support only subscription option which relies on the data already being in sync between publisher and subscriber. If the data is not in sync, for example down to publisher database activity not being stopped then we can run into replication DML errors such as attempting an update on a row that doesn't exist (which can also happen if rows are deleted from the subscriber database directly, so remember to apply sensible permissions).

At first these errors can be difficult to diagnose because the issue isn't obvious in replication monitor. It will inform us that an error has occurred with an error message such as 'The row was not found at the subscriber when applying the replicated command' along with a transaction sequence number (the really important bit), so you'll need to get familiar with...

sp_browsereplcmds

In order to find the actual error we need to run sp_browsereplcmds using the transaction sequence number for the @xact_seqno_start and @xact_seqno_end parameters (here's a working example) to retrieve the command in error. This command is definitely one to get familiar with.

MSrepl_errors

In the above example the sequence number is visible in replication monitor but sometimes certain errors do not expose the sequence number in the visible error text. In this case there is system table in the distribution database named MSrepl_errors which can be sorted by the time column to find the latest errors. There can be a lot of noise in this table but by using the time and error_text details you can normally locate the problem and then use the corresponding xact_seqno value with sp_browsereplcmds as you would in the example above.

Replication in Availability Groups

If you're using SQL Server 2016 or greater then you can place the distribution database in an Availability Group and then when setting up publishers use the appropriate AG listener. However, this works fine UNLESS you're using merge, bi-directional or peer to peer replication, then it won't work fine...well it won't work at all!

Publishers can also be setup in Availability Groups although rather than set them up directly within an Availability Group they must be setup on the primary replica first and then create a publisher redirect on the distribution database pointing to the Availability Group listener. During failover from the primary replica replication will then be rerouted to the listener keeping it available.

Subscribers to transactional replication can also be setup within Availability Groups. When setting up the subscription ensure that the listener name is used rather than the actual server name. There is a difference though when using push or pull subscriptions; push requires no additional action but when using pull subscriptions you must manually create the distribution agent job on secondary replicas and for future failovers handle enabling and disabling the jobs on the old and new replicas.

Replication in Azure Managed Instances

Worth a mention; replication is supported by Azure Managed Instances. You can't use merge replication, or peer-to-peer or updateable subscriptions but for transactional and snapshot replication it's supported. 

I did have some fun and games setting it up. The initial issues were down to Azure port permissions on the file share that needs to be used for the snapshot folder (Azure MI was distributor and publisher). I also found any modifications to be much smoother when using T-SQL rather than SSMS.

Another issue was a problem encountered when setting up a new publication on a database restored to the managed instance from an older version of SQL Server which was failing due to a 'column does not allow nulls. INSERT fails' error on a system table (IHPublications).

In this case the database was already being used in replication so when restored I suspect the system tables came along with it. On the original SQL Server instance the column did not allow NULL values which add publication process on the managed instances was trying to add. Following the guidance in this article all system objects were dropped and then the publication was successfully created.

Push and Pull Subscriptions - Reading the Documentation!

Primarily the difference between push and pull subscriptions is where the distribution agent resides; for pull subscriptions the job runs on the subscriber and for push it runs on the distributor. In larger scale setups using a higher number of push subscriptions performance issues can occur on the distribution server (which can then cause issues on the entire replication system) so when using a large number of subscribers pull subscriptions should be considered. There are further considerations when replicating over a WAN; here's a useful article which also contains a handy tip on manually copying a snapshot to a subscriber.

As mentioned previously in the Availability Group section there are additional considerations when using push and pull subscriptions. Often there can be subtle differences between the types when initially setting them up, making modifications to them or even when removing. In all Microsoft provides pretty good documentation when it comes to the various replication activities.

SQL Server Updates & Version Compatibility

According to Microsoft SQL Server fully supports upgrading databases used in a replication topology, however there are some rules you have to follow:

  • Distributor version must be greater than or equal to the Publisher
  • Publisher must be equal to or less than the Distributor
  • Transactional subscribers can be within two versions of the Publisher
  • Merge subscribers can be all versions equal to or below the Publisher
However; I've found that in some circumstances whilst setting up subscribers in Management Studio fails due to version incompatibility I have been able to set them up without any issue when using T-SQL - I assume it's not supported so still definitely not recommended

Patching

Upgrading SQL Server instances in a replication topology does require planning and a without doubt a rehearsal (or a few) on a test system, including a rollback plan. As well as compatibility considerations there are other steps relating to replication that require adding to the outage plan such as stopping user activity and running replication agent jobs. This guide from Microsoft covers the necessary steps. It is recommended to perform updates to the distributor first, then publishers followed by subscribers (although for in-place upgrades the publisher and subscribers can be in any order I prefer to keep the process consistent). 

Republishing Data

Replication can also be setup in what is known as a republishing topology; this basically means that a subscriber can also act as publisher  (the data it receives in a subscription is also published to another subscriber).

There are a things you can and cannot do and supported scenarios; here's the official Microsoft documentation on the topic. I found this a really useful option when migrating a very large and busy replication topology to a new system with zero downtime. 

Removing Replication

Finally, one major thing I have learned is that whilst replication can be fairly straightforward to setup it can be a real pain to remove. A classic example when trying to remove a publication is to receive a primary key violation error and then when trying to remove said primary key (or it's related table) you then receive a cannot remove as it's used in replication error, it's a brilliant catch-22 scenario.

In this case publication often needs disabling at the database level by using sp_replicationdboption setting @optname = N'publish' to false and then proceeding to drop the publication. This command though is at the database level - not just a single publication

If all else fails and replication is still hanging back there is a fall back option, sp_removedbreplication which when ran on the publisher or subscriber database will remove all related replication objects. Just remember to manually remove any related SQL Agent jobs from the distributor afterwards.

For now that is all, I'm sure I'll be making edits, updates and additions to this post in the future but if you are working with replication hopefully it's been of some use and pointed you in some useful directions. Replication is a complex technology but I'd also argue it's incredibly robust and flexible in some ways too. Until next time!

Breaking up with SQL Server

I was inspired to write this after reading a post from Dave Mason regarding breaking up with Big Tech companies. Yet again I haven't wr...