Locked files on SMB share

I’ve been experiencing an issue with files becoming read-only locked on a Server 2012 R2 file share, typically across the WAN.

Usually once per day at least, we would have a user report that a file had been marked read-only on the file server in an unexpected way.

Here’s some of the instances that have occurred:

  • A person is working on a drawing over a period of an hour or two, attempts to save the drawing they’ve had open for a while and receive “file is read only”
  • A person goes to open a drawing, gets warning it’s read-only, but the user who previously had it open closed it minutes or hours ago.
  • A person goes to open a drawing, gets warning it’s read-only, but the user mentioned in the warning has not touched the drawing since the last restart (perhaps “recent files” holding it open?)

95% of these issues were related to AutoCAD .dwg files, but it occasionally happened to Excel files too.

I used handle.exe from sysinternals to verify that the file was actually opened by a process (acad.exe) and it consistently was; there was just no explanation for why or how this process was opening or holding open the file handle without user interaction or knowledge.

 

I finally traced this to a series of registry changes that were being pushed out as ‘optimizations’ for SMB, which had been recommended here: https://msdn.microsoft.com/en-us/library/dn567661%28v=vs.85%29.aspx#clients

Primarily, we had defined:

HKLM\System\CurrentControlSet\Services\LanmanWorkstation\Parameters
Key New Value Original Value
FileInfoCacheEntriesMax 32768 64
DirectoryCacheEntriesMax 4096 16
FileNotFoundCacheEntriesMax 32768 128

When I reverted these values back to original, the reported issues universally stopped according to my users.

This just speaks to the increased need of change tracking in my organization; it would have been relatively simple to correlate the first reported instance to a set of changes in the same time frame. Implementing that system is easier said than done however.

Network up but DNS mysteriously broken

I was recently troubleshooting a computer for a family member, where they reported “I can’t access the Internet” and the resolution was something I’ve never seen before.

This was a laptop with both an Ethernet and Wifi connection. They were both set to DHCP with dynamic DNS, and IPCONFIG displayed the correct information.

I could ping to 8.8.8.8 confirming network connectivity, and an NSLookup found my gateway acting as a DNS server which could properly resolve external names.

However, as soon as any browser attempted to access a DNS name, it failed. Chrome gave a “DNS_Probe_Finished_Nxdomain” error, and IE simply stated “Page could not be found”.

I checked the Hosts file for malicious entries, ensured no proxy was enabled within IE, and verified the routing table was all normal.

I ran ComboFix and GMER to look for rootkits, and started the computer in Safe Mode with Networking but none of these resolved the issue.

Finally I decided to install WireShark and run ProcessMon while the browser connection was made, in an attempt to see where these requests were going.

When trying to run WireShark after the install though, it gave an error about a missing “dnsapi.dll” file. I verified the file was in the proper location (c:\windows\system32), but on a hunch decided to refresh it from SFC with this command:

sfc /scanfile=c:\windows\system32\dnsapi.dll

The output confirmed a corrupted file was replaced, and then I rebooted Windows. Once it came back up, all external browsing worked!

I suspect that some malware had gotten onto this machine and modifed the dnsapi.dll file, but at some point had been partly removed.

This one left me confused for a while, so hopefully this helps anyone else coming across the issue.

 

Hyper-V NIC Team Networking issue

I encountered an issue with a Hyper-V virtual machine recently that had me very confused. I still don’t have a great resolution to it but at least a functional workaround.

At a site I have a Server 2012 R2 Hyper-V host (Host), a Server 2012 R2 file server guest (FileServer) and a Server 2012 R2 domain controller (DC).

The Host has two NICs in a single switch-independent address hash team. This Team is set as the source of the vSwitch which has management capabilities enabled. We wanted to use a NIC team to provide network redundancy in case a cable was disconnected as this network is being provided by the site owner rather than my company.

Host and FileServer could reach DC, but nothing else could. DC could reach Host and FileServer, but not even it’s own default gateway.

This immediately sounded like a mis-configured virtual switch on Host, as it appeared DC could only access internal traffic. But I confirmed the vswitch was set to “External” and if this were the case the FileServer would have presumably been affected by this issue as well, but it was not.

I tried disabling VMQ on the Host NICs, as well as Large Send Offload, since both those features have been known to cause problems, but that did not resolve the issue either.

I tried changing the teaming algorithm to Hyper-V Port and Dynamic, but that didn’t resolve the issue either.

Then I decided to put one of the NICs into a Standby state in the team. This caused the accessibility to switch between the VMs; all of a sudden nothing external could reach my file server, but the DC came online to external traffic.

I tried changing which NIC was in standby but that still left me with one VM that had no connectivity.

My assumption at this point is that this issue is being caused by Port Security on the network switches; something that we are aware the site owner is doing. I suppose that the Team presents a single MAC address across multiple ports, which the port security doesn’t like and so it blocks traffic from one side of that team. Because of how traffic is balanced across the team it leaves one of the VMs in an inaccessible state.

Unfortunately we do not have control over this network or the ability to implement LACP, and so I’ve had to remove the NIC teaming and go back to segregated NICs for management and VM access.

 

Excel slow to open on Windows 10

I’ve been having an issue with Excel 2013 (from the Office 365 Click to Run installer) for a while now that I finally decided to dig a little deeper on.

I found that when I was double clicking on a file from Windows Explorer to open in Excel, it would take 15-30 seconds before the application would appear.

However, if I opened Excel from the start menu, or through a “Run” command, it would appear instantly.

I tried many different things to isolate this issue such as checking conflicting processes, watching processmon, and eliminating the network as a source of problems.

 

Eventually I hit the right combination of google keywords and came across this post.

 

Based on that recommendation I disabled Cortana and immediately saw improved response times from Excel. Hopefully Microsoft fixes this bug in time for Office 365 integration with Cortana!

 

File Server and remote office collaboration

I had someone email me about DFSR and file locking based on some old comments on a Ned Pyle post from the Ask DS blog on technet.
I wrote up a detailed response of how I’ve handled this issue in my environment, and decided that response would make a good post to share, so here it is:

 

It’s a long story actually. Not sure if you read my post from April 2013, but I effectively gave up on DFSR due to the instability we were experiencing. Still not sure if it was underlying storage causing it, or just the scale at which we were operating. PeerSync couldn’t keep up with the rate of change on our file servers when we piloted it, and neither could PeerLock when we tried to integrate it with DFSR.

So stuff crashed in April 2013, I gave up on DFSR and we intended on using Silver Peak WAN acceleration across our site-to-site VPNs, combined with RDS or Citrix XenApp.

Then my company was acquired by a much bigger engineering firm, and all my plans were interrupted. We basically left our users to suffer (since we had already used DFSR to create a single namespace, when we quit DFSR we just kept a ‘single source of truth’) for months because the parent company was going to roll out a global WAN.

That did eventually happen in late 2014, so at that point we had Riverbed Steelheads providing 80% data reduction across the WAN and a 7ms link between our biggest branch and head office (where the most pain was for ACAD production).

We occasionally heard rumbling from staff that it wasn’t good enough, and so we pressed on-wards with purchase of a 30-seat Citrix XenApp environment with a dedicated nVidia GRID K2 card for GPU acceleration. However we really screwed up the user communication and training and failed to get adequate buy-in from CAD management to enforce it’s use.

XenApp works fantastic for apps like ArcGIS Desktop and GlobalMapper, but

we really struggled with mouse lag in all AutoCAD streams (except Revit which we don’t use). We went as far as a specialized consultant out of the UK and they couldn’t figure it out either.

I had begun the process of looking at Panzura, which I raved about here. It still looks like the best overall solution to a centralized file server with geographic collaboration on AutoCAD files. But the price tag for two offices was close to $60k not counting cloud storage costs, and I had just purchased a whole bunch of new storage so i didn’t want new Panzura hardware, and they don’t support Hyper-V for their virtual instance (which still blows my mind).

Ultimately, a couple months ago we had a big meeting with all our CAD managers and consensus was that:
– The majority of work performed in branch offices is acceptable due to the Riverbed Steelheads and low latency provided by the global WAN
– The work that doesn’t perform well due to huge file sizes would be done with XenApp and the mouse lag would just be accepted. But I don’t think the CAD managers are actively encouraging this.

And that’s where we’re at right now; not really solving the issue but trying to find improvements as best we can.