๐จ
Notable DNS Outage Incidents
Major DNS failures that broke the internet and lessons learned
DNS is the phone book of the internet, which means outages have enormous impact. From the 2016 Dyn DDoS attack to the 2021 Facebook BGP incident, we examine real-world large-scale DNS failures to understand why DNS redundancy and monitoring are essential.
Architecture Diagram
Major DNS/Infrastructure Outage Timeline (2016-2023)
!
2016.10
Dyn DDoS (Mirai Botnet)
Severity: Critical
Root Cause: Mirai botnet leveraged 100K+ IoT devices for 1.2 Tbps DDoS attack
Affected Services: Twitter, GitHub, Netflix, Reddit, Spotify, PayPal
Duration: ~6 hours (3 waves of attacks)
!
2019.06
Cloudflare BGP Route Leak
Severity: High
Root Cause: Small ISP (DQE) via Verizon leaked BGP routes absorbing Cloudflare traffic
Affected Services: Cloudflare, Amazon, Linode, Discord
Duration: ~2 hours
!
2021.06
Fastly Config Error
Severity: Medium
Root Cause: Customer config change triggered software bug causing 85% of global CDN nodes to go down
Affected Services: Amazon, Reddit, Twitch, The New York Times, UK Gov
Duration: 49 minutes
!
2021.10
Facebook BGP Incident
Severity: Critical
Root Cause: Automated config change removed BGP routes โ DNS servers disappeared from the internet
Affected Services: Facebook, Instagram, WhatsApp, Messenger, Oculus
Duration: ~6 hours
!
2022.06
Cloudflare BGP (19 DC)
Severity: High
Root Cause: Network change accidentally terminated BGP sessions in 19 data centers
Affected Services: Cloudflare, Discord, Shopify, Fitbit
Duration: ~90 minutes
!
2023.03
.au DNSSEC Expiry
Severity: High
Root Cause: .au TLD DNSSEC signing key (KSK) expired causing validation failure โ all .au domains inaccessible
Affected Services: All Australian .au domains
Duration: Several hours
DNS Outage Prevention Checklist
1
Use Multiple DNS Providers
Configure 2+ DNS providers (e.g., Route 53 + Cloudflare DNS) to eliminate single point of failure
2
Enable DNSSEC Key Auto-Rotation
Set up KSK/ZSK expiry alerts and auto-rotation. Lesson from the .au incident.
3
BGP Route Monitoring
Use BGP anomaly detection services (e.g., Cloudflare Radar, BGPStream). Defend against route leaks/hijacks.
4
Establish TTL Strategy
Normal TTL 3600s, lower to 300s before changes. Enables fast failover during incidents.
5
Config Change Safeguards
Canary deployment, automated rollback, dry-run validation before changes. Lessons from Facebook/Fastly.
6
DDoS Mitigation Plan
Use Anycast DNS, rate limiting, scrubbing center integration. Industry standard since the Dyn incident.
Key Points
•
Most DNS outages are caused by "human config errors" or "automation script bugs" โ operational process matters more than technical flaws
•
BGP is the core internet routing protocol but has weak authentication โ RPKI adoption is progressing
•
Depending on a single provider means their outage is your outage โ redundancy is not optional, it is essential
How It Works
1
DNS/BGP failure occurs due to external attack (DDoS) or internal misconfiguration
2
DNS lookup fails โ cannot resolve domain names to IPs
3
Impact spreads: all services using the affected DNS/CDN become simultaneously unreachable
4
Root cause identification and recovery (BGP route restoration, DDoS mitigation, config rollback, etc.)
5
Wait for DNS propagation to complete (minutes to hours depending on TTL)
Pros
- ✓ Learning from incidents: proactive preparation from real-world outage lessons
- ✓ Redundancy design: improved availability by eliminating single points of failure
- ✓ Enhanced monitoring: reduced MTTR through early outage detection
- ✓ TTL strategy: pre-adjusting TTL for quick DNS failover during incidents
Cons
- ✗ Increased cost: higher operational costs when using multiple DNS/CDN providers
- ✗ Increased complexity: complex record sync and failover configuration across multiple providers
- ✗ No perfect defense: Mirai-scale DDoS attacks are hard to fully defend with a single service
- ✗ BGP dependency: even if DNS itself is healthy, BGP routing issues can make it unreachable
Use Cases
Multi-DNS provider strategy (Route 53 + Cloudflare in parallel)
DNS outage monitoring system setup (Pingdom, UptimeRobot)
DDoS defense planning (Anycast, Rate Limiting, WAF)
Incident response playbook creation (OOB access, emergency contacts)