So, just for the sake of it I’ve been trying to get my lab to be HA -or as HA as a small homelab can be-
My current set up is a follows:
3 proxmox servers with some Debian VMs, the VMs run docker swarm
A NAS, with Truenas
ISProuter -> OpenWRTRouter -> VM [Port fowards 80/443]
This works like a charm when I am in my LAN, but when I access from outside, if the VM that has 80/443 port forwarded to is down (which it never is) I’d loose connectivity.
I have now idea how to solve this little problem in a efficient way, maybe a reverse proxy running on my OpenWRT? (Which’d only move the point of failure to my router, but if my router goes down is gameover already anyways) has anyone attempted this?
Any opinions/ideas?
Set up a VPN to your router as a backup. If something goes down, you can still vpn into your LAN and reach all services.
The way I handle this is to have two VMs running in separate hosts, each running my reverse proxy along with keepalived. I resolve my subdomains to the keepalived shared address and then keep the reverse proxy config in git with a cron job to pull updates.
You’re discovering that there’s ALWAYS a single point of failure. Even if every service is fault tolerant, you likely have a single network or power infrastructure. So, you have to figure out what you’re willing to tolerate. You could look into CARP or keepalived to make your reverse proxy more resilient. It’s probably overkill for a homeland, but could be a useful learning exercise.
Our new dog chewed up the Ethernet cable from my modem to my router while I was at work (well, commuting to) the other day. She found the only exposed 6 inches of it and went to town. Everything runs through the router. I had also just re-done some music library file structures and reset my downloaded songs right before leaving, assuming it would queue up and fill up the cache as I went about my day. Something I hadn’t done for over two years, but I wanted a music library so we could put calming music on for the pup that wouldn’t end up in my carefully curated library.
I have my music app set to pre-cache 10 songs, and ended up with 12 songs downloaded, so somewhere around 5-10 minutes after I started playing music on my commute was when the tasty cable was discovered. That was an excruciating day, listening to the same 12 songs over and over again.
Lesson learned about single points of failure in a new way. The worst part was I got a message about it from my fiancé when I got to work, so I knew what happened and there was nothing I could do about it. I just got to look at the world’s strongest firewall all day long.
Our new dog chewed up the Ethernet cable
Ugh! I had some of the same issues a while ago with a Jack Russell I adopted. Cool dog, high octane energy, eager to learn new things. Since he was teething, everything became a chew toy regardless of the mountain of chew toys I had already provided. USB cables, Ethernet cables, power cords, I’ve replaced a bunch. Thing about a Jack Russell is you can teach them anything and they are eager to learn and please, however, if they pick up a bad habit, it’s hard to break them of that. He doesn’t chew anything any more, but there for a stint, he was hell on wheels.
I remember a TV station I worked at, that had a lot of good redundancies with 3 redundant UPSs that could keep a bunch of equipment on air until the big generator took over, one day had the UPS controller die and took all 3 UPSs out. I think it took the engineers a couple days to get everything back up and running.
Def went into the rabbit hole without any idea how many of these single points I’d need to address, and the more I mitigate the more I find. Like you said, this is very much overkill, I am just doing it to learn and have some good old homelab fun before we are all forced to rent “cloud” PCs
Thanks for the suggestions, I’ll look into those!
If you ask me, you are better off focusing on monitoring, fast detection, and auto-healing in a homelab rather than High Availability. I use an ancient tool called monit and newer tools like uptime kuma for this. Detection and restart is easier than having 2 of everything.
I have a Pfsense router and run HAproxy on it. Most of the services I have run on 3 VMs in a Docker Swarm. HAproxy can point to all three and just uses the first to respond. I think this is what you are going for. I haven’t tested how robust this solution is because my primary motivation was wanting to play with Docker Swarm once I accepted K8s was not worth the effort.
Focus more on why the service is going down, and solve for that. Make it reliable by restarting automatically in the face of failures. A Reverse Proxy should be dead simple, and not change states between restarts, so it shouldn’t be dying in the first place. Having it restart on failures should be simple and reliable.
It never goes down :-) I just want to make it better.
Alright, im kind lying, it use to go down all the time cause my NICs would hang, but I did fix it. This problem did gave me the itch to make it even more avaliable.
There’s only so much reliability you can build into a simple home setup without it being a major loss on investment. In a datacenter situation, you’d have fault tolerance on all the network ingress: load balancers, bonded interfaces, SDWAN configurations…etc.
Unless you want 3 of everything you own, just do the basics, OR I guess consider hosting it elsewhere 🤣
You’re talking high availability design. As someone else said, there’s almost always a single point of failure but there are ways to mitigate depending on the failures you want to protect against and how much tolerance you have for recovery time. instant/transparent recovery IS possible, you just have to think through your failure and recovery tree.
proxy failures are kinda the simplest to handle if you’re assuming all the backends for storage/compute/network connectivity is out of scope. You set up two (or more) separate VMs that have the same configuration and float a virtual IP between them that your port forwards connect to. If any VM goes down, the VIP migrates to whatever VM is still up and your clients never know the difference. Look up Keepalived, that’s the standard way to do it on Linux.
But you then start down a rabbit hole. Is your storage redundant, the networking connectivity redundant, power? All of those can be made redundant too, but it will cost you, time and likely money for hardware. It’s all doable, you just have to decide how much it’s worth for you.
Most home labbers I suspect will just accept the 5mins it takes to reboot a VM and call it a day. Short downtime is easier handle, but there are definitely ways to make your home setup fully redundant and highly available. At least unless a meteor hits your house anyway.
The more I go into this rabbit hole, the more I understand this, and I understand now that I went into the hole with practically 0 knowledge of this topic. It was so frustrating to get my “HA” proxy on LAN with replicated containers, DNS and shared storage, hours sank into getting permission to work, just to realise “oh god, this only works on LAN” when my certs failed to renew.
I do not think I need this, truth is that the lab is in a state where I have most things I want[need] working very well and this is a fun nice to have to learn some new things.
Thanks for the info! I will look into it!
IIRC there’s a couple different ways with Caddy to replicate the letsencrypt config between instances, but I never quite got that working. I didn’t find a ton of value in a HA reverse proxy config anyways since almost all of my services are running on the same machine, and usually the proxy is offline because that machine is offline. The more important thing was HA DNS, and I got that working pretty well with keepalived. The redundant DNS server just runs on a $100 mini PC. Works well enough for me.
Like you’re thinking: put HAProxy on your OpenWRT router.
That’s what I do. The HAProxy setup is kind of “dumb” L7 only (rather than HTTP/S) since I wanted all of my logic in the Nginx services. The main thing HAProxy does is, like you’re looking for, put the SPOF alongside the other unavoidable SPOF (router) and also wraps the requests in Proxy Protocol so the downstream Nginx services will have the correct client IP.
Flow is basically:
LAN/WAN/VPN -> HAProxy -> Two Nginx Instances -> AppsWith HAProxy in the router, it also lets me set internal DNS records for my apps to my router’s LAN IP.
Thanks for the info, I kinda wanted someone to confirm that could before I sank some hours into it :)
Sure, big orgs do it all the time. A pair of load balancers with virtual IPs that route traffic, either to a reverse proxy or right to the endpoint.
But your router is still a SPOF, as is your ISP.
How much availability is worth the time spend setting up and maintaining this, though?
My setup is a bit different, but maybe you can reuse part of it. Instead of using swarm for HA, I’m using proxmox. The LXCs have a failover to other machine (if they go down), and a static IP address, so if the http proxy machine goes down, it boots back up in another machine, with the same IP (and thus a working port forward).
This does mean that I have to keep the configuration in sync between different machines so my RPO is never too big, but for something like NPM where the config rarely changes, this isn’t much of an issue.
Super, thanks. I will look into this. I actually have some shared storage on my NAS for traefik configs which is already working on replicated instances.






