How to determine the hostname for absolute links
Links in web applications can be
- Relative links like
<a href="/foo">
, or - Absolute links like
<a href=“https://
.www.site.com/foo”>
If you’re writing a web application, relative links are generally the way to go. It’s simpler to move the application to a new domain. In addition, for those counting bytes, relative links are shorter.
However, sometimes, absolute links cannot be avoided:
If you need to redirect to another page, it was necessary in the past to use absolute URLs. (Relative URLs are now supported.)
If you include links in emails you send, the links in that email need to be absolute.
No doubt other reasons I can’t think of right now.
The question is: where does the domain name etc. come from to build those absolute links?
There are two approaches available:
- Extract the domain name from the incoming request
- Configure the application to know its domain name (e.g. environment variable)
Instinctively I always tend to go for option 1. I regret it every time. Do not go for option 1.
Option 1 certainly seems more elegant, doesn’t it? However, it has the following problems:
Firstly, your application will probably sit behind a load balancer or reverse proxy. The user will connect to the load balancer, and the load balancer will connect to your application. Therefore, you can’t use the headers like Host:
defined in the HTTP standard to determine which server the user attempted to connect to.
The load balancer does write that information into additional HTTP headers, however, there is no standard for the names of those headers. Apache does it differently from the AWS load balancer, for example. As evidence of this, neither Apache, Nginx, Jetty nor Tomcat get this right “out of the box” when using AWS’s load balancer (which is a pretty standard product at this point). A common issue is that although the domain name is right, which protocol (HTTP vs HTTPS) is wrong.
(Technically it’s not true that there is no standard for these headers. The trouble is there is more than one standard. I have the suspicion that there are actually n standards, where there are n load balancer products in existence.)
Another issue is that even if you do manage to parse these headers for your load balancer, the application will no longer generate URLs correctly when tested locally. You need to account for the fact that there might be no load balancer, as well.
What if you need to deploy the application to a customer’s infrastructure? What sort of load balancer do they use? If it’s a large company, the department managing firewalls etc. might not even be the one you’re in contact with.
Secondly, you might want to write a nightly batch job that sends emails like “since you last logged in, you have received 3 new messages, click here to read them”. These emails are sent out by a batch job, and not in response to a user’s request. Therefore, even if you had all the HTTP header stuff sorted out, you’ll still need a way to configure the absolute URL (e.g. environment variable)
My conclusion, and recommendation, is: if you use a config item (e.g. environment variable) to specify the absolute URL in some cases (nightly emails) you might as well sidestep the effort and fragility of parsing those HTTP headers, and use a config item always.