Why do HTTPS requests include the host name in clear text?
I'm having a little bit of trouble understanding why the HTTPS protocol includes the host name in plain text. I have read that the host name and IP addresses of an HTTPS packet are not encrypted.
Why the host name cannot be encrypted? Can't we just leave the destination IP in plain text (so the packet is routable), then when the packet arrives at the destination server, the packet is decrypted and the host/index identified from the header?
Maybe the problem is that there can be different certs for one particular destination IPs (different certs for different subdomains?), so the destination server cannot decrypt the packet until it arrives at the correct host within that server. Does this make ANY sense, or am I way off?
The encryption algorithms need to be setup before encryption can take place. Which is why everything is initially in the clear. The domain name is generally part of the certificate the server sends in the clear. Any domain or IP that is associated with the server's certificate is going to available anyway. It's part of identifying the server. HTTPS doesn't provide anonymity, just security.
After the handshake takes place and encryption algorithms selected, then why wouldn't everything be encrypted other than the destination IP?
To clarify: I used to think that the HTTP `Host` header was somehow left visible when HTTPS is used. That's not the case. All HTTP headers, query params, body, etc are encrypted within the TLS connection. Instead, the TLS connection itself begins with a handshake that includes the `server_name` field, which may be necessary for the server to respond with the appropriate certificate, and/or for a shared hosting provider to determine which customer's application the request is intended for, as they share the same IP address.
In almost all cases(1), a DNS query has been done immediately before the first(2) HTTPS connexion giving away the domain name in clear text. (1) Exceptions are when the domain name is defined in a local file (like the `hosts` file), or when a previous DNS query returned a wildcard answer (`*` answer). (2) Subsequent connexions will reuse the cached DNS answer and also may reuse the TLS session.
The hostname is included in the initial SSL handshake to support servers which have multiple host names (with different certificates) on the same IP address (SNI: Server Name Indication). This is similar to the Host-header in plain HTTP requests. The name is included in the first message from the client (ClientHello), that is before any identification and key exchange is done, so that the server can offer the correct certificate for identification.
While encrypting the hostname would be nice, the question would be which key to use for encryption. The key exchange comes only after identification of the site by certificate, because otherwise you might exchange keys with a man-in-the-middle. But identification with certificates already needs the hostname so that the server can offer the matching certificate. So the encryption of the hostname would need to be done with a key either based on some other kind of identification or in a way not safe against man-in-the-middle.
There could be ways to protect the hostname in the SSL handshake, but at the cost of additional overhead in handshake and infrastructure. There are ongoing discussion if and how to include encrypted SNI into TLS 1.3. I suggest you have a look at this presentation and the IETF TLS mailing list.
Apart from that, leakage of the hostname can also occur by other means, like the preceding DNS lookup for the name. And of course the certificate sent in the servers response is not encrypted too (same problem, no key yet) and thus one can extract the requested target from the servers response.
There are a lots of sites out there which will not work without SNI, like all of Cloudflares free SSL offer. If accessed by a client not supporting SNI (like IE8 on Windows XP) this will result in either the wrong certificate served or some SSL handshake error like 'unknown_name'.
I see some parallels between this and the "https everywhere" Google push. While it would be great to encrypt everything and prevent all MitM possibilities (even for streaming sites), how will we deal with all of caching and bandwidth savings that encrypting will negate?
@JoshvonSchaumburg I'd say the argument is that the infrastructure is now mainly used by stuff that doesn't cache anyway (e.g. watching youtube videos). After all, if it did work, why would companies bother with their own CDNs? The ISPs and infrastructure maintainers have no motivation to do caching well anymore, because they're paid by the byte ;)
I'm not sure about this explanation. For one, ISPs DO have motivation to cache well. Most all of them (in the US at least and I assume elsewhere) offer unlimited per flat monthly fee. This means they absolutely have incentive to cache. Additionally, even if they did offer data driven plans (let's take wireless carriers, for example, or the test markets where ISPs are capping wired broadband), service providers still have incentive to cache within their networks because then speeds are greater. Also, they do not need to peer with other networks to pull the video streams with caching servers.
@JoshvonSchaumburg Since an attacker would know which site he wants to attack and would gather information about this site by legal means (Google, opening the site in his browser and alike), he would get the information you are desperately trying to hide at the click of a button, namely the mapping of a hostname to its IP. There are attacks which use plaintext injection - something which wouldn't be needed if the hostname was encrypted. So encrypting the hostname would generate a vulnerability for securing information which can easily be acquired by other means.
@MarkusWMahlberg: the goal of encrypting SNI would not be to hide the mapping from hostname to IP, but to hide the actual hostname you access on systems which have lots of hostnames per IP (typical hosting services, CDN...). And I don't see your argument that encrypting SNI would create new vulnerabilities, I only see that it would be either considerable overhead or simply ineffective.
@SteffenUllrich some ciphers used have weaknesses against known plaintext attacks. Since every request would contain some known plaintext, this would impose a theoretical weakness. I am not sure wether hiding which hostname one accesses would actually add security other than plausible deniability.
@MarkusWMahlberg - I don't think that posing a question for the sake of better understanding a protocol is hardly my "desperately trying to hide" something. Feel free to leave off the sarcastic and condescending tone in the future!
@JoshvonSchaumburg My comment was neither intended to be sarcastic or condescending and I would like to express my utmost regretfulness for any offended feeling. It was a mere intellectual play and please be so kind to see my comment as part of an academic dispute, as intended.
Your implication that I was "desperately" trying to hide information that could be ascertained by the "click of a button" was, in my opinion, condescending towards my limited knowledge on the subject, but I will accept your apology.
@MarkusWMahlberg: I don't think that your argument about known plain text is relevant. Each HTTP request/response inside a TLS connection (i.e. HTTPS) contains more known plain text than the SNI name.
@Luaan Because caching allows them to charge for bytes they're not actually transferring, of course. And YouTube videos could be cached, if Google were not pushing HTTPS everywhere.
@immibis Well, that's not really specific to Google; I wouldn't be surprised if most of website traffic over the internet was HTTPS nowadays. There's even ISPs that try to force you to install their own root CAs!
If the hostname was transmitted after DHKE but before SSL, then it could be protected against passive snooping.
@Bengie: At least in TLS versions smaller than TLS 1.3 the key exchange is started only after receiving the certificate. But, choice of certificate requires knowledge of the expected hostname on systems with multiple certificates on the same IP address. This means sending the name after DHKE is too late since is needs to be known earlier to choose the appropriate certificate.
SNI is there for virtual hosting (several servers, with distinct names, on the same IP address). When a SSL client connects to a SSL server, it wants to know whether it is talking to the right server. To do that, it looks for the name of the intended server in the certificate. Every evil hacker can buy a certificate for his own server (called
evilhacker.com), but he won't be able to use it for a fake server posing as
honest-bank-inc.combecause the client browser, trying to connect to
https://honest-bank-inc.com/, will be quite adamant at finding in the purported server certificate the string
honest-bank-inc.com, and certainly not
So far so good. Then becomes the technical part. In "HTTPS", you have HTTP encapsulated in SSL. This means that the SSL tunnel is first established (the "handshake" procedure) and then, only then, does the client send the HTTP request within that tunnel. During the handshake, the server must send its certificate to the client. But, at that point, the client has not yet told to the server who it is trying to talk to. The server may assume that, since it received the connection, the name that the client wants to reach is probably one of the site names that the server hosts. But that's guesswork and fails if the server hosts several sites with distinct names.
There are mostly three ways out of this conundrum:
One IP per site. Since the server knows the target IP address right from the beginning of the connection, the server can use that IP address to know which server is the true target. Of course, the relative shortage of IP address makes that traditional solution less attractive nowadays.
Several names in the server certificate. This is perfectly valid. Google themselves tend to put more than 70 names in their certificate. A variant is "wildcard names" that contain '*', matching many names. However, this works only as long as all names are known when the certificate is issued. For a Web site hosting service, this would mean buying a new certificate whenever a customer registers.
SNI. With SNI, the client sends the intended name early in the handshake, before the server sends its certificate. This is the modern solution, and now that Windows XP has gone beyond the brink, it can finally be widely used (IE on Windows XP was the last browser that did not supported SNI).
Good explanation. So if I were still using IE on XP and navigated to an HTTPS site on a server hosting multiple HTTPS sites with multiple certs, how would the server know which cert to provide without SNI support?
@JoshvonSchaumburg: The server does not know and would either provide the default certificate of the site (which will result in rejection by the browser because of name-mismatch) or return an "unknown_name" alert or something like this.
The fact that you are communicating to some server obviously cannot be hidden from IP. The packets have to leave your machine, enter the network, be routed to the destination, and be delivered. It's not a secret that you're contacting a server that delivers pages from https://www.fred.com.
However, the URL does not contain the IP. Instead, it contains more than one piece of information. Not only does it contain the host name (which has to be resolved to an IP address), it contains details about the specific request: route it to a server at that address named www, search, mail, etc, and along this directory path name.
Hosting services have exploited this indirection to support multiple sites on a single web server via Server Name Indication. So both https://www.fred.com and https://www.barney.com might be hosted on a server at 127.0.0.1, and it's only the name that distinguishes how the server will route that message internally. It's possible that for security reasons it will need to be routed to a separate server, where the actual keys will be stored. The keys to decrypt that message may not exist on that front-end server, so it can't get decrypted until it arrives at the machine hosting the fred site.
Besides the need to include host name to resolve certificate there is one common practice connected to usage of SNI, which is important to know: it allows implementing https://en.wikipedia.org/wiki/Virtual_hosting over SSL/TLS.
SNI is used by all popular Web servers: