What a week; actually started over the weekend, with a DFSR database crash on a spoke server within my topology. This happened for a replication group that is 1.6TB in size, so the volume check takes quite a long time.
During this time, replication hung so I decided to restart the DFSR service on our hub server. Unfortunately, the restart failed, and the service was hung at “stopping”. So I killed the dfsrs.exe process, and then started the service.
At this point, it tried to repair the DFSR database, but failed so it went into “initial replication”. Initial replication on a 1.6 TB replication group is a thing straight from my nightmares. Compounding the problem is the fact that the hub server then crashed the next night (which I haven’t had time to look into yet) and basically had to restart the process.
That was 3 days ago, and after all this time, I’ve got initial replication finished but a backlog of 10,000 files going to 2 of the spoke servers. That backlog didn’t appear to be moving, and investigating the DFS Replication section of the Event Log revealed:
The DFS Replication service encountered an error communicating with partner SW3020 for replication group swg.ca\files\jobs. The service will retry the connection periodically. Additional Information: Error: 9032 (The connection is shutting down)
The steps I took to fix this error:
- On the hub server, I deleted the individual connections from the hub to the spoke servers for this specific replication group
- From a domain controller in the hub site, I ran this to ensure those changes reached the branch sites sooner:
repadmin /syncall /e /A /P - Then I re-created the connections for each spoke and re-ran the repadmin command.
Following that, both servers showed this in the DFSR log:
The DFS Replication service failed to communicate with partner SW3020 for replication group swg.ca\files\jobs. The partner did not recognize the connection or the replication group configuration. The service will retry the connection periodically. Additional Information: Error: 9026 (The connection is invalid)
So from each spoke server, I ran the following:
dfsrdiag pollad /v /member:hubserver.domain.com (Replication partner)
dfsrdiag pollad /v /member:hub_site_dc.domain.com (Domain Controller in hub site)
Shortly thereafter I saw this in the logs:
The DFS Replication service successfully established an inbound connection with partner SW3020 for replication group swg.ca\files\jobs.
And now replication traffic is flowing properly. Now all I have to do is deal with the more than 500 conflict files this whole ordeal has generated.
worked a treat, thanks
DFS is not an enterprise solution. Get a decent SAN and get on with more important things in your life.
Pretty dumb statement
Just wanted to say thanks for this. I was about to take a sledgehammer to a server in the off chance that it might knock some sense into it.
Thank you so much!!! For me it works again.
Cheers
Olaf
Please let me add my thanks. I jumped right to the dfsrdiag pollad… and saw a successful inbound connection event almost immediately.
Perfect, thank you 🙂
Amazing, works perfectly, thanks!
I have such problem too. You have helped me, thank you!
nice your article
what do you means, when you said
” I deleted the individual connections from the hub to the spoke servers for this specific replication group”
can you help me how did you do?
Sorry, it’s been quite a few years (almost 10!) since I wrote this post, and I haven’t had the opportunity to work with DFSR in quite some time.
My recollection is that within the DFSR mmc control panel, I removed each connection within the replication group. I can’t remember any more specific details though.
Hii, Admin
Thanks for sharing with us!!
https://mantridevelopers.in/