DNS redundancy, how to do it right?

Question

I hope I'm doing this right, since it's my first submission. It is a question directed to sysadmins of HN:How do I reach nameserver redundancy?Right now our provider is getting DdoS'ed, so my employer is not reachable by mail, web etc. If I do a whois on the affected domain, I'll get multiple nameservers (which the provider owns).Looks like this:nserver ns01.provider.tld nserver ns02.provider.tld nserver ns03.provider.tld nserver ns04.provider.tld nserver ns05.provider.tldActually two questions arise from this:- Is it a good idea to setup my own nameserver which basically just "copies" the entries from my current provider and specify it (wherever that may be). By doing this I won't have to maintain 2 different NS, only the one from the provider since the 'secondary' will simply be a copy of the primary?- Is it a good idea to simply increase the TTL of the important A/MX-Records? Will for example, 1.1.1.1 still resolve my domain correctly, even if my providers nameserver is down for an hour? (assumed I have a TTL of 3 hours for example)Thankfully, I'm not the CTO, but since he mentioned to me that this happens regularly to the provider (being DdoSed), it got me really curious what the right mitigation to being unreachable is.

elp · Accepted Answer

My $dayjob is at a domain registry operator.
Find out if your provider will allow you to add your own nameservers and allow zone transfers to them. Most will but find out because you REALLY don't want to synchronize changes manually.
You don't need anything fancy, add one or two of your own nameservers. Something at Hetner, OVH, DO, AWS etc are all fine. You only need small basic Linux box with Knot, NSD, or Bind installed and a gig or 2 of memory.
Don't worry about Powerdns if you only have a couple domains. Its not worth the extra setup for a secondary then.
Make sure you do not allow recursive queries to the rest of the world (Bind) and make sure you turn on the rate limiting to be safe. As a first step that will really help. Obviously longer term you want move the nameservers away from your current provider and either outsource the management or set up the primary yourself. The general rule we recommend is at least 3 nameservers preferably on multiple continents.
This setup is very robust and will stop everything short of a serious DDOS attack. If that's a real concern then you need to outsource to a specialist. I like and have used netnod.se and Packet Clearing house (https://www.pch.net/) but they are very much not free.
If you are going to do all the DNS yourself then Powerdns is great, but get someone with DNS skills locally to give you real advice.

oneplane · Answer

Unless it is your core business, you pay companies like Cloudflare to do it for you.
If it is your core business, but you are also big enough: you also pay Cloudflare or companies like them to do it for you.
If you are in between those two: as long as your name servers can be found, they can also be DoS'ed. But they can't be found, they also can't resolve anything. And now you're getting into the true problem: the bigger pipe tends to win, and if the bigger pipe has more origin ASNs and IPs to bug you with, individually black holing them won't be feasible either. So now you need to have a 'bigger pipe' and that's not something most companies want or can invest in.

c0l0 · Answer

At $oldjob, we once got an extortion email of dubious credibility from an unidentifiable party, claiming they would DDoS our (a lucrative commercial web enterprise with lots of daily users) infrastructure, if we refused to pay 100 BTC (the BTC<->USD rate was a lot lower back then than what it is now ;)). As the infrastructure lead, I used this as an opportunity (albeit in a bit of a hurry ;)) to strengthen our resilience against this kind of threat, also on the DNS level.
The pair of authoritative nameservers, which we were self-hosting at the colo space we rented, was based on PowerDNS, with a replicated PostgreSQL database behind it. A "shadow master" postgresql instance was were control over zone data could be exercised, and that postgres instance used streaming replication to shuttle its dataset to read-only secondaries over a purpose-specific SSH tunnel (nowadays, we'd probably be using wireguard instead). There, powerdns authoritative DNS server instances picked up the zone data from the host-local postgres databases, and served that up by means of DNS.
This setup proved very easily extensible (spinning up a new, additional secondary was a matter of a few minutes via a simple ansible playbook that set up a new SSH tunnel, a postgres hot standby, and a new pdns instance that drew its zone data from the local postgres instance), and we chose to deploy two additional nameservers at dedicated server providers in nearby Europe to host our tertiary and quaternary authoritative DNS servers. The only remaining, but tedious task left was updating all the glue records for the domains we handled on these nameservers.
In the end, the entire threat proved hollow, as the deadline passed with zero impact on any of our infra. We never learnt if it was just empty to begin with, or if the adversary decided not to bother attacking a visibly well-prepared site. But the resilience-improved DNS infrastructure was a nice thing to be able to rely on in the coming years, and I think pretty much the same architecture/setup is still in operation to this day.

justizin · Answer

Use a large provider's DNS. I can't suggest supporting CloudFlare because of their dubious selective enforcement of ToS regarding sites that actively organize and encourage harm to people, but there are much lighter weight alternatives.
AWS Route53 is basically free and doesn't require you to use any other AWS services, the same is probably true for any other cloud provider. Smaller providers like DigitalOcean / Linode / whatever should also be fine - I use DO for personal stuff, but would happily use it for larger capacity projects. Many cloud provider DNS APIs are supported in terraform, so you don't have to worry about what the UI is like.
Your DNS registrar also probably offers this service.
I will say this: if your provider is unable to protect their own DNS service, you should find a better provider. While CloudFlare and other similar services have incredibly resilient DNS, most folks don't need that. Anyone who is in the business of hosting online services should be capable of running a resilient DNS service. If not, you have to ask yourself how resilient anything else they offer you is.
While there are lots of ways to screw it up, DNS is incredibly simple compared to basically any other service, reliability-wise. BIND on any reasonable hardware can handle katrillions of queries.

toast0 · Answer

If all of your records are traditional static records, DNS redundancy is pretty simple. DNS includes AXFR for secondary servers to pull the zone data from the primary. You can set up your own primary and use commercial services as secondary, or use one service as primary and another as secondary, etc. If the primary is ddosed, you might not be able to make updates if the secondaries can't connect for axfr, but often that happens on different ips than the public service ips, so there's a chance it still works.
But, it gets a lot trickier if you use any sort of DNS based load balancing or other trickery. Then you've got to set up both services as close to the same as you can and cross your fingers; there's tools out there for that, I think terraform can do it, but there's more focused tools as well. This is a good practice, but it's hard, so it's usually not done.
Top tier DNS services rarely get (succesfully) DDoSed or have other service outages, but it does happen.

bwoodcock · Answer

Hi. I'm with Packet Clearing House, which @elp mentioned. I would second all of their technical advice, but note that PCH is a public-benefit non-profit, so exists to provide service at no cost to governments (ccTLDs) and critical infrastructure operators (mostly IXPs and CERTs) but, as required by the IRS, charges market rate to for-profit private-benefit organizations.
A few additional notes:
- You should keep what's un-politically-correctly generally referred to as a "hidden master" for your zone data on a machine that's somewhere that won't be targeted by a DDoS that's aimed at you or your ISP, and have an ACL that only permits zone transfers to your authorized secondary authoritative servers.
- You should probably get a few other organizations to act as public-facing authoritative servers for you, so all your authoritatives don't share any avoidable common failure modes. Different people administering them, different technology stacks on different hardware in different places.
- For servers you run, consider running DNSdist in front of them. It's a DNS load balancer which has very efficient internal caching, and which will allow you to answer a lot more queries per core than a full-fledged nameserver would. Run it in front, even on the same machine, to get more bang for your buck.
- A high TTL will indeed help a lot with DDoS against your nameservers (since everyone will cache answers rather than being dependent on getting a live connection to your nameservers. But it will also make you less nimble in responding to a DDoS against your actual content servers, since you won't be able to move them quickly to a different provider. I tend to favor high TTLs, but reasonable people support both sides of that argument.

SadWebDeveloper · Answer

Are you getting "DDoS'ed" from the DNS Server or from the HTTP Server?
Usually the later is more common since DNS tends to be a quite robust software to handle some levels of heavy the traffic (and the protocol being _lighter_ than HTTP).
Anyway you should separate the DNS Server from the Web Server, for this particular case personally i recommend "DNS Cloud providers" like AWS Route 53, they give you by default 4 different geo-located points and provides an API if you want to fight back the DDoS'ers (by changing your DNS records to 127.0.0.1 or 255.255.255.255 for a short period of time), usually these solve the email issue.
As for the web server, this is tough a well placed DDoS won't be stopped even Cloudflare have been hit with huge attacks that they couldn't handle (despite what their PR department says, the fact that even a bigger network than Cloudflare like Akamai couldn't protect Brian Kerbs tells you a lot about how tough these space is), best way is null routing the bad actors and spreading different ways to access your services like asking customers to go to frontendXYZ.mydomain.tld.

johnklos · Answer

"Right now our provider is getting DdoS'ed, so my employer is not reachable by mail, web etc."
"our" here might suggest that your provider is also your employer's provider, and that your employer is not reachable in general by anyone by mail, web, et cetera. But reading the rest of your message makes me think that perhaps you're saying that your personal provider is being attacked, not your employer's, and therefore you can't reach your employer's mail, web, et cetera. Is that the case?
Is the attack just taking out your provider's DNS servers? If so, then just run your own recursive resolver. It's literally as easy as setting up BIND on any machine on your network with a default configuration file that does the barest minimum. Clients on the same subnet will be able to query it without problems.
There's no reason, nor advantage, to running anything using your provider's DNS servers when you have your own, particularly when their DNS servers can be taken down so easily.

Sevan777 · Answer

The general advice is that you should have more than 1 name server and they should be on different networks/servers, to prevent the exact issue you're suffering, that a DDoS on one network doesn't cause an entire namespace outage. You are right to think about increasing the TTL of anything that's static and constant, like MX records. That TTL value buys time when there are issues before it is propagated to hosts that need to query your records as it's not cached at their end. Another trick, is to have the authoritative server where you perform the updates not referenced in your domains NS records, and instead only list the secondary (replica) servers in your NS records. So you maintain control and any attack based on NS records is on the replicas, and they refetch the record from the authoritative server periodically (based on SOA record settings).

nilmask · Answer

You want multiple NS records containing hostnames of nameservers from two or more providers on independent infrastructure (i.e. not two providers that are both hosted on, for example, AWS' compute).
I would advise against running your own nameserver unless you have confidence in your ability to operate it correctly.
You can increase the TTLs if you don't anticipate record data changing frequently, or are able to tolerate delays in your DNS record changes being served (until the cached answers expire).
Choice of resolver (e.g. 1.1.1.1, 8.8.4.4) is out of your control (except of course, on your own devices and machines). Increasing the TTLs may improve robustness, assuming that your clients' resolvers are well-behaved and respect TTLs [0].
[0] https://www.ctrl.blog/entry/dns-client-ttl.html

gmuslera · Answer

You can have multiple secondary servers, what is not a good idea is to have just one, because whatever it happens with it things may fail elsewhere (i.e. mails lost, that may be worse than losing connectivity with your website for a while). And once the zone is setup and the secondary working, you only need to modify things in the primary.Regarding TTLs, how frequently you modify those records, or how probable is that you have to modify them in a near future? Those times will say to resolvers for much time they could cache that information, and give a time for propagation of a change in that information. With current bandwidths and having slaves you can have relatively frequent updates, but have some margin to have meaningful caches in remote resolvers to speed up access to your sites.

sgjohnson · Answer

> Will for example, 1.1.1.1 still resolve my domain correctly, even if my providers nameserver is down for an hour?
1.1.1.1 is virtually impossible to DDoS, because it’s anycasted in _a lot_ of places, and Cloudflare has the capacity to mitigate the largest of DDoS attacks.
> - Is it a good idea to setup my own nameserver which basically just "copies" the entries from my current provider and specify it (wherever that may be). By doing this I won't have to maintain 2 different NS, only the one from the provider since the 'secondary' will simply be a copy of the primary?
This would be a caching NS. It's not a bad idea, especially if it can automatically forward to a different service provided one is down, but you might as well just use 1.1.1.1. I've never seen 1.1.1.1 down.

sredevops · Answer

In this case, the vendor probably needs to be replaced. That's the easy fix. The higher TTL will help as it won't be querying them, but then making changes will be a pain in the future, until unless you pre-plan.
These are some things I am have though about for managing DNS. Any other recommendations would be great.
1) Separate Registrar from the source of the DNS records
2) Multiple NS records on the registrar
3) Multiple vendors for DNS records with Terraform sync between the two vendors
4) Limit A records or point A Records to CNAMES
5) Low TTL times for NS records, risky if you are hacked though or make a bad change

m3047 · Answer

Simple answer:1) Spend money.2) Get a different provider.Beyond that, get free help or advice. I guess that's what you're attempting now.Let's start with spending money. You need to convince somebody that it's worth it. (Know your assets, know your risks, etc.) If it's not worth it or it's killing the company, and they won't do anything about it except to send you out to beg for help: get another job. Seriously.As for the provider and "DOS": put their balls in a vice. What kind of DOS, exactly? What mitigations do they employ, exactly? Are they visibly seeking advice and assistance? Where's their outage page? I (charitably) assume you're being coy because you don't know... not because you're just being coy. Get the facts; cache la poudre; name and shame them.Who else is affected? Band together, share notes and intelligence. Openly. Fully. Go read some of the DNS server mailing lists and dns-ops. If you can't swim in those waters, go home. Hide, and hope they go away.Most of the answers here are akin to poking a dead beached whale: "smells bad!" "look, there's its liver!" "that's the blowhole, that's how it breathes" "looks like a propellor strike": factual, but not gonna help the whale.I'm baffled by the premise of your question: exactly how does this lead to needing to do "redundancy" correctly? Is the provider not doing it correctly? No evidence is provided to support the assertion.Reachability and services: there are a lot of tactical measures depending on how services are hosted. Mirroring the domain is a tool in the arsenal, depending on your line of business and communications needs (monitoring a SCADA system for emails to make sure the nuke doesn't melt down is different from some rando wanting to return a party dress)."multiple nameservers": anycast is a thing.Ummm... mirroring the domain ("just 'copies'": WTF?) IS maintaining an NS.I'm not going to assume anything about your TTL. Name the domain. Tell me the TTL. Let me confirm it. ("simply increase the TTL": WTF? I think "simple" is the important word there. None of this is simple.)--m3047 | FWM6, internet plumber

adql · Answer

Pay someone.But if you really want. PowerDNS + MySQL replication for the "slaves" then put slaves in few different places with good bandwidth.Instant updates, very bulletproof.

LinuxBender · Answer

Is it a good idea to setup my own nameserver which basically just "copies" the entries from my current provider and specify it (wherever that may be). By doing this I won't have to maintain 2 different NS, only the one from the provider since the 'secondary' will simply be a copy of the primary?
Do you mean authoritative secondary replicas? If so, that is not uncommon. If your DNS provider is being targeted and your company is not then your DNS servers will still respond and a percentage of clients will try them. While root servers allow 10 records, anything beyond 4 will become less useful as different resolvers cap the number of NS records they will try. You can look up which OS/resolver has what behavior and then make a decision based on what OS most of your customers use. Amazon for example uses batches of 5 anycast records. If your commercial DNS provider is also Anycast, theny one could use 2 of those records and 2 of your own company hosted DNS just fine.
Look into how your commercial DNS provider handles zone transfers then set up a couple decent servers that uptake all the zones you want redundancy for. Just know there is no concept of priority meaning what order they are listed in the root servers does not matter. Whatever servers you set up will need to take a percentage of the traffic your commercial provider is absorbing. If doing this on a VPS provider I would suggest Vultr as they support Anycast meaning you can spin up many VM's to handle the load and still only have a couple public IP addresses without any load-balancer bottlenecks.
Is it a good idea to simply increase the TTL of the important A/MX-Records? Will for example, 1.1.1.1 still resolve my domain correctly, even if my providers nameserver is down for an hour? (assumed I have a TTL of 3 hours for example)
There are pros and cons to high TTL's depending on how your organization handles changes, failovers, etc... There are some discussions on the web about these pros and cons, too many to name here. It is also important to understand how clients actually cache high TTL's. For example, some clients will cap NS TTL to 86400 seconds regardless of how high they are and some clients will cap A TTL to 1 or 3 days. Then there is the factor of recursive server memory and end-users. ISP caches will expire records much faster regardless of TTL due to memory pressure. Each ISP and public DNS server handles this a little differently. So a high TTL can sometimes help assuming your infrastructure does not depend on being able to fail over things fast and that you are not planning on changing MX end-points. This requires some foresight into how one architects their infrastructure to fully recognize the benefits from higher TTL without incurring operational risk.
I am testing 1.1.1.1 right now and it took many requests to finally get my records cached on all their nodes, so if your domain is popular enough they may be useful.
I suppose that was a long-winded way of saying, "It depends". You should meet with your infrastructure team and think through what systems depend on having a low TTL and keep those low. For anything else a higher TTL is probably fine.
[Edit] It sounds like maybe you were just asking about recursive servers so most of this doesn't even apply.
Some places to browse for more detailed answers would be StackExchange [1] ServerFault [2] SuperUser [3] Just be sure to lurk a long time before asking questions. They are particular about how questions are formatted, how on-topic they are for the particular forum and if one has done an exhaustive search for existing answers.
[1] - https://unix.stackexchange.com/
[2] - https://serverfault.com/
[3] - https://superuser.com/

matheweis · Answer

> Right now our provider is getting DdoS'ed, so my employer is not reachable.
> It got me really curious what the right mitigation to being unreachable is.
Sometimes what can be done technically and the best solution are not always the same.
Is there a reason that you can’t simply move to a better provider without resorting to more advanced DNS infrastructure?
Cloudflare for example is sort of the gold standard for DDOS protection and they also offer excellent DNS services.
There are other major DNS service providers with excellent reputations as well, aws, dnsmadeeasy comes to mind.