Measurement and Analysis of
Private Key Sharing in the HTTPS Ecosystem

Frank Cangialosi§, Taejoong Chung*, David Choffnes*,
Dave Levin, Bruce M. Maggs, Alan Mislove*, Christo Wilson*

§MIT CSAIL, *Northeastern University, University of Maryland, Duke University and Akamai Technologies

Paper Overview

Abstract

The semantics of online authentication in the web are rather straightforward: if Alice has a certificate binding Bob’s name to a public key, and if a remote entity can prove knowledge of Bob’s private key, then (barring key compromise) that remote entity must be Bob. However, in reality, many websites—and the majority of the most popular ones—are hosted at least in part by third parties such as Content Delivery Networks (CDNs) or web hosting providers. Put simply: administrators of websites who deal with (extremely) sensitive user data are giving their private keys to third parties. Importantly, this sharing of keys is undetectable by most users, and widely unknown even among researchers.

In this paper, we perform a large-scale measurement study of key sharing in today’s web. We analyze the prevalence with which websites trust third-party hosting providers with their secret keys, as well as the impact that this trust has on responsible key management practices, such as revocation. Our results reveal that key sharing is extremely common, with a small handful of hosting providers having keys from the majority of the most popular websites. We also find that hosting providers often manage their customers’ keys, and that they tend to react more slowly yet more thoroughly to compromised or potentially compromised keys.

What do we mean by "key sharing"?

In general, we are referring to the scenario where one party makes its certificate's private key available to another party. Since this is difficult to observe as an outsider, we restrict our definition of key sharing in terms of what we can observe:

We say that key sharing has taken place if any of the parties named in a certificate (either the Common Name or entries in the SAN list) are not the same entity as the organization who owns the IP address from which it is advertised.

Why study key sharing?

The security of any public key encryption system rests on keeping private keys private; sharing private keys across entities violates these assumptions. A single website choosing to share its private key with a hosting provider may seem relatively innocuous, but large numbers of websites sharing with a small number of hosting providers may lead to even greater centralization of trust than was previously realized. Our results expose trust relationships in the HTTPS ecosystem, complementing a large body of work (see §7 in our paper) that has studied similar trust relationships between websites and CAs.


For more information check out our paper or slides from CCS'16

Primary Datasets

  1. SSL Certificates: We used SSL certificates from 74 (roughly) weekly scans of port 443 over the entire IPv4 address space, collected by Rapid7 between October 30, 2013 and March 30, 2015. We observed 38,514,130 unique SSL certificates, of which 5,067,476 were valid leaf certificates. These certificates contain 2,552,936 unique domains (including domains in the SAN lists). A consolidated list of these certificates and their key details are available below. For more detailed info, see the original webpage.
  2. Reverse DNS: We used full IPv4 reverse DNS scans conducted by Rapid7 to look up the entity controlling each IP address that we observed advertising a certificate in (1). Unfortunately, the DNS standard does not require address owners to provide reverse DNS entries. This data is included as a field in the leaf certificates dataset.
  3. AS Number and Organization: In cases where reverse DNS was not available, we used daily snapshots of CAIDA's RouteViews datasets, which map IP addresses to ASNs, and then aggregated ASes owned by the same organization using CAIDA's AS-to-Organization dataset.
  4. WHOIS: We combine WHOIS data from two sources covering 2,197,292 (86.0%) of the 2.5M domains from (1). The domains where we were unable to find WHOIS data were typically those that have either expired or whose registrars did not publish WHOIS data. Unfortunately the terms of service from these sources do not allow us to share the full WHOIS records we obtained. However, we are able to provide the list of email addresses appearing in the WHOIS record of each domain, which is the only information we actually used from the records.
The certificates in (1) form the basis of our study. The reverse DNS and AS organization information from (2) and (3) provide insight into the organizations that advertise these certificates, and the WHOIS datasets from (4) reveal information about the domains present in the certificates themselves.

Name Type Size Format SHA-256 Hash (Compressed) Labels
Leaf Certificates gzipped tsv (tab-separated values) 1.1 GB README Show dc025023fa1fde39c98ce928e84a5158c49949eeed3bc6c6ffb044ba71eb25f4 CERTS RDNS
IP to ASN gzipped directory 292 MB README Show 3e9931bc0e6b09efd32abc1e2e362bef3febd65622d87a368088a71e39133fb7 ASN
ASN to Organiztion gzipped directory 35 MB README Show 984ed95082cf4a8b3dd114f589e698bc04fbab9a55fdfd727371cc709ca69dd7 ASN
WHOIS Record Emails
gzipped ssv (space-separated values) 68 MB README Show 1efc7fbce69b7e01ac2f2b756e9c0ef3ba5fa0bc1dd49b341b2ec5bb3637c524 WHOIS

Processed Datasets

Determining who owns a domain CERTS WHOIS

The first dataset groups all of the domains in our dataset into separated organizational entities. This is an important tool in our study because it allows us to identify cruise-liner certificates (as opposed to certificates with many domain names from a single organization, as with Google). Also, reporting on how many organizations share their keys avoids over-inflating numbers—-a single organization’s decision to use a third-party hosting provider could result in all of its domains’ keys being shared, and some organizations own hundreds of domains.

As outlined in the figure above, we first created a graph linking all domains from our certificate datset to the email addresses appearing in their WHOIS records, and then used the Louvain community detection algorithm to cluster these domains into groups of organizations. For more details, please see Section 4.1 in our paper.


Determining a site's (third-party) hosting providers CERTS ASN RDNS

The next dataset provides the ability to determine which third-party organizations host a given certificate. We first identify all possible hosting providers by looking up each IP address from our certificate dataset in our reverse DNS and ASN datasets. We then unify these for each certificate (e.g. a reverse DNS entry of softlayer.com and an AS Organization Name of Soft-Layer Technologies Inc. represent the same organization and thus should only be counted once). Finally, using the domain ownership methodology from the previous section, we conclude that a certificate is...

  • first-party hosted if: the certificate contains only one unique organizational entity and all of the IP addresses serving that certificate are owned by that same organization.
  • third-party hosted if: there is more than one organization on the certificate (e.g. as is the case with cruise-liner certificates) or if any of the IP addresses from which it is served owned by an organization not in the certificate.
For more information please see Section 4.2 from our paper.

Determining who manages a certificate CERTS

This dataset determines who manages each certificate: the organization(s) on the certificates or the hosting provider serving the certificate? Determining who is revoking or reissuing a certificate is nontrivial: revocations and reissues do not express who exactly requested them (after all, the PKI was designed on the premise that the entity listed on the certificate is the sole owner of the secret key).

Our insight is that hosting providers who manage their customers' certificates are responsible for obtaining many new certificates, and would therefore, out of convenience, likely gravitate towards a small set of certificate authorities when obtaining certificates. More specifically, we anticipate that when the population of users from a given provider (mostly) obtains their own certificates, this distribution will resemble the distribution of CAs across the entire population of certificates. On the other hand, when a hosting provider manages certificates on its customers' behalf, we anticipate the distribution will be skewed very heavily towards a small set of issuing certificates. For more details please see Section 6.1 in our paper.

Name Type Size Format SHA-256 Hash (Compressed)
Domain to Organization Mapping ssv (space-separated values) 35 MB README Show 926a2ffd5e57cbab32a2d891c31b040532f67848dcaff422e34095f92d1a6120
Third-Party Services Hosting Each Certificate gzipped ssv (space-separated values) 138 MB README Show 1f38a7eaef45ab3a8b32fb0d95e248e8769838e5252eda39a386977986e6e80e
Management Policy of Third-Party Hosting Services 2 gzipped tsv 3.1 MB README Show eea41858f6c2b7636366acca93502794985b9f59ac690f9d92f53a0e238935fd

Analysis and Plots

All of the scripts used to create the processed datasets from the primary datsets, analyze the processed datasets, and produce all of the plots in our CCS'16 paper can be downloaded from GitHub here.

Contact

If you have any questions, comments or concerns, or if you're interested in using our data in your research, please email Frank Cangialosi!