Wednesday, January 12, 2011

DNS trouble with Server 2008 R2

We are having an odd DNS problem and I’ve been receiving a lot of good help from many wonderful folks on Twitter. However, we are trying to solve this in 140 characters and that is proving difficult so here is the whole story. If this fails the next step is to contact the friendly folks at Microsoft for only $500.

We run Active Directory Integrated DNS on Windows Server 2003. We have a single DNS server that works fine but we are migrated from our physical Server 2003 box to our virtual farm. Our virtual servers are running Server 2008 R2. The plan was, and hopefully still is, to add a Server 2008 R2 Domain Controller with DNS, allow it to replicate the DNS via Active Directory, and then point all the clients to the new DNS server. After sufficient time for that to take place then we would remove our old 2003 DNS server and just be running on the new 2008 R2 server.

I ran dcpomo on the 2008 R2 server, which added DNS, and the replication started. Everything went just as numerous articles said it would. The problem is that the new DNS server will not reliably resolve external domain names. It works fine internally but now when going upstream to our external DNS server, which is provided by our local fiber internet service.

When you run an nslookup on the new 2008 R2 DNS server you get a time-out the first try. You can then run it a second time at which point it works and will continue to work until you try a different domain. So if you nslookup yahoo.com the first response will be a time-out. The second and each subsequent time it will work fine until you switch and try google.com. Then that one will work but if you go back and try yahoo.com again it will time-out the first time.

The AD integration part is working as both the old 2003 DNS server and the new 2008 R2 DNS server are keeping each other up-to-date however for reasons we haven’t uncovered yet the 2008 R2 server is not resolving external names.

We have confirmed that our firewall is EDNS and DNSSEC compatible and passes the larger UDP packets. We have also tried disabling EDNS on the server but it has no effect. Whether on or off the time-out still occurs.

Seems we are missing something obvious but what?

23 comments:

  1. What are the DNS Server you've got listed under your NIC? First 127.0.0.1 and second an IP of another server? Vice versa?

    When you drop to command prompt... type in nslookup... do you get an immediate failure of any kind, or does everything appear normal?

    After you type in nslookup, and the applet launches, type in the word "server" (no quotes) and hit enter... what do you see returned?

    What it "smells like" is that your first DNS entry isn't working as expected - maybe service failure, maybe upstream DNS forwarder, maybe the presence of the "." root (which doesn't need to be there), or something else.

    Can you get us more details about your actual NIC/DNS Servers setup, and some of the output from NSLookup?

    ReplyDelete
  2. Under the NIC the only DNS server I have listed is the server itself.

    I don't get any failures when I start nslookup. I get the FQDN for the default server and the proper address.

    When I type in server I get the FQDN of the server, the address and then I get:
    *** dc1.faith.lafayette can't find server: Non-existent domain

    The upstream DNS server is working as my 2003 DNS server is using it and it is working fine.

    Here is an NSLookup output:
    > google.com
    Server: dc1.faith.lafayett
    Address: 10.0.0.5

    Non-authoritative answer:
    DNS request timed out.
    timeout was 2 seconds.
    Name: google.com
    Addresses: 209.85.225.106
    209.85.225.147
    209.85.225.99
    209.85.225.103
    209.85.225.104
    209.85.225.105

    or sometimes I just get

    > cnn.com
    Server: dc1.faith.lafayette
    Address: 10.0.0.5

    DNS request timed out.
    timeout was 2 seconds.
    DNS request timed out.
    timeout was 2 seconds.
    *** Request to dc1.faith.lafayette timed-out

    Hope this helps.

    ReplyDelete
  3. And you have reverse PTR records in place?

    nslookup

    type in "10.0.0.5" and what do you get as a response?

    type in the IP of another server - do you get an error, or the FQDN as expected?

    it still seems to me that your 2008 R2 servers are doing "root hints" or you have the "." root zone in your DNS server.

    ReplyDelete
  4. lastly, just on a hunch, are you doing any firewall ACL filtering on port 53? Is your firewall only expecting DNS resolution on your old/2003 DNS servers, and you forgot to open up/adjust the firewall for the new IP addresses of the 2008 R2 servers?

    ReplyDelete
  5. The reverse PTR record is in place. Here is what I get when I do 10.0.0.5 in nslookup.

    > 10.0.0.5
    Server: dc1.faith.lafayette
    Address: 10.0.0.5

    Name: dc1.faith.lafayette
    Address: 10.0.0.5

    Here is what I get when I do another server.

    > 10.0.0.7
    Server: dc1.faith.lafayette
    Address: 10.0.0.5

    Name: sql01_server.faith.lafayette
    Address: 10.0.0.7

    I have confirmed I am not doing root hints and that the "." is not in DNS server. Unless I'm not looking in the right place but it seems pretty obvious.

    I have also confirmed that the SonicWall is passing the IP address of the new DNS server.

    ReplyDelete
  6. so - humor me

    change the DNS settings on the NIC

    primary 127.0.0.1
    secondary IP.OF.ANOTHER.2008R2

    what does that do for you? any difference?

    and you don't see anything in eventvwr.msc / DNS Server?

    ReplyDelete
  7. No change. No errors in event viewer either.

    ReplyDelete
  8. Sorry dude. Without looking at it, I'm not sure. I asked our engineers and 100% of them said "root hints and forwarders to fix" - but if you've verified you aren't using root hints, and setup forwarders, and restarted appropriate service, then I'm not sure right now.

    Bummer.

    ReplyDelete
  9. Thanks for the help. I think we may have an issue with our VMware setup. Either our template that we are deploying our VM's with is corrupt some how or there is an issue at the VM layer. To test this we are deploying a physical Server 2008 R2 box to see if it works there. If it does then we go back to the VMware layer or our deployment template.

    If it doesn't work on the physical box then it is a Microsoft issue and we may have to bring them in.

    I really appreicate your insight and do look foward to seeing you next month in Florida despite what the Twitter me says.

    ReplyDelete
  10. LOL - "the twitter me"

    Ditto my friend.

    ReplyDelete
  11. For those keeping track, we setup a physical 2008 R2 server and it has the same problems. We are now going to test connecting it directly to the internet and bypassing everything except our upstream connection hardware to see if we can narrow things down further.

    ReplyDelete
  12. Looks like this is a firewall issue but not an obvious one. SonicWall has confirmed it is something on their end but they are not sure what. Could be firmware related but they want to confirm that in the lab before assuring me that a firmware update will solve the problem.

    Everyone agrees that there does not appear to be anything wrong with our current configuration which makes firmware the likely suspect but I'm glad they are confirming.

    ReplyDelete
  13. Fixed! We finally got it fixed! Turns out it was a SonicWall firmware issue but nothing we could detect. All of our tests said it was working and SonicWall admitted that we would see that because their firmware was dropping the return UDP packets that it deemed too large and then not logging that it was dropping said packet.

    SonicWall support helped get our firmware fixed and admitted the problem was on their end. We were running an out-of-band firmware release to solve problems with our SonicPoint N devices. This issue fortunately is limited to that release alone and since so few users are running that version we were the first to report the problem.

    5 days later an issue we could not detect or prove existed is finally resolved. Now to cleanup the mess we made along the way.

    ReplyDelete
  14. Hoooray! If it was UDP packets/large then that points back to eDNS/DNSSEC. Very glad you got it sorted out.

    I had that same problem if you recall with my Cisco ASA firewalls until I changed the global policies, etc.

    Good work dude. Good work.

    --DW

    ReplyDelete
  15. I have the exact same problem you mention, vmware 2008 r2 DC. physical 2003 boxes are fine i have disabled EDNS but the problem still persists. Spoke to firewall engineer they claim the fortigate is not the problem. Ahhhhh where do i start?

    ReplyDelete
  16. Fortunately SonicWall was willing to work on the problem and admit it was their fault. They proved it in their lab. This is where you find out the true nature of a hardware provider.

    ReplyDelete
  17. Hi Jonathan,
    Im using neatgear FVX538 firewall.. Im having the exact same problem. What is the model of the firewall you are using..
    Thanks

    ReplyDelete
  18. We are using a SonicWall E5500 series in HA configuration.

    ReplyDelete
  19. Hi,
    I struggled with this problem after installing win 2008 R2 SP1. Than I finally found your blog post here, and I suppose I've fixed my problem. You mentioned about large UDP packets. I have DrayTek Vigor 2920, and it has DOS protection and have an option about enabling "UDP flood defense". I disabled it and I didn't do deep testing but I don't have any timing outs right now.

    Regards, and thanks.

    ReplyDelete
  20. I'm glad my post helped you out. You will have to check wiht DrayTek regarding how it passes those packets but I don't think that DOS proection or UDF flood defense will have anything to do with it as the issue revolves around the router being able to handle those packets.

    ReplyDelete
  21. I was having the EXACT same issue..created a 2008R2 from template, DCpromo went w/o incident. DNS BPA showed failures for all root servers, inside DNS was OK.
    Demoted this server/deleted it. Recreated server the same way...same problem
    After reading this article, I did a reload on the PIX 506E, FIXED IT!!!!

    THANK YOU!!!

    ReplyDelete
  22. You are most welcome. I always benefit when others post these kinds of issues so I try to do the same when I stumble across them hoping it will save someone else time.

    ReplyDelete