What a week; actually started over the weekend, with a DFSR database crash on a spoke server within my topology. This happened for a replication group that is 1.6TB in size, so the volume check takes quite a long time.
During this time, replication hung so I decided to restart the DFSR service on our hub server. Unfortunately, the restart failed, and the service was hung at “stopping”. So I killed the dfsrs.exe process, and then started the service.
At this point, it tried to repair the DFSR database, but failed so it went into “initial replication”. Initial replication on a 1.6 TB replication group is a thing straight from my nightmares. Compounding the problem is the fact that the hub server then crashed the next night (which I haven’t had time to look into yet) and basically had to restart the process.
That was 3 days ago, and after all this time, I’ve got initial replication finished but a backlog of 10,000 files going to 2 of the spoke servers. That backlog didn’t appear to be moving, and investigating the DFS Replication section of the Event Log revealed:
The DFS Replication service encountered an error communicating with partner SW3020 for replication group swg.ca\files\jobs. The service will retry the connection periodically. Additional Information: Error: 9032 (The connection is shutting down)
The steps I took to fix this error:
- On the hub server, I deleted the individual connections from the hub to the spoke servers for this specific replication group
- From a domain controller in the hub site, I ran this to ensure those changes reached the branch sites sooner:
repadmin /syncall /e /A /P
- Then I re-created the connections for each spoke and re-ran the repadmin command.
Following that, both servers showed this in the DFSR log:
The DFS Replication service failed to communicate with partner SW3020 for replication group swg.ca\files\jobs. The partner did not recognize the connection or the replication group configuration. The service will retry the connection periodically. Additional Information: Error: 9026 (The connection is invalid)
So from each spoke server, I ran the following:
dfsrdiag pollad /v /member:hubserver.domain.com (Replication partner)
dfsrdiag pollad /v /member:hub_site_dc.domain.com (Domain Controller in hub site)
Shortly thereafter I saw this in the logs:
The DFS Replication service successfully established an inbound connection with partner SW3020 for replication group swg.ca\files\jobs.
And now replication traffic is flowing properly. Now all I have to do is deal with the more than 500 conflict files this whole ordeal has generated.