Several months ago I wrote a post called Where, oh where, does the API key go? I encouraged API providers to allow consumers to put the API Key in the Authorization header to help avoid accidental disclosure of keys via things like web server logs. I recently bumped into a way that anyone can harvest hundreds of API keys from many different web sites, including ones that charge significant amounts of money for access.
The API Keys I discovered are in HttpArchive. HttpArchive is a project started by Steve Souders as a tool to help make the web faster. All the data collected by HttpArchive is made available via Google's BigQuery project. There is a discussion site where there are all kinds of conversations about queries that are being run on HttpArchive data and their performance impacts.
When I first heard about the HttpArchive I naively assumed that the data was being collected from the logs of some big piece of internet infrastructure. I suppose if I had looked more closely at the data being collected I would have realized that the data had to be collected via another method.
The answer to how HttpArchive collects its data is in another incredible tool WebPageTest. HttpArchive pulls down a list of URLs from the Alexa Top 1,000,000 web sites and then kicks off a bunch of WebPageTest machines to navigate to those URLs and record all of the requests made when loading the sites.
This query against the HttpArchive is all it takes to pull back more than 800 unique API keys from the most recent dump of data.
SELECT method, REGEXP_EXTRACT(url, r'([^:]*)') as scheme, REGEXP_EXTRACT(url, r'://([^/]*)') as host, REGEXP_EXTRACT(url, r'apikey=([^&?]*)') as ApiKey FROM httparchive:runs.latest_requests WHERE url LIKE '%apikey=%' group by 1,2,3,4 ORDER BY 1,2,3,4
Hundreds more can be found with different variations of api_key, api-key and ApiKey. Pulling the key from URL is definitely the easiest. However, HttpArchive also records request header values. With a little more RegEx foo, you can start pulling API keys out of headers like X-ApiKey and X-Authorization.
Unfortunately, you can also access credentials included in the Authorization header. This was the one header that I was really hoping would have been filtered out of the test results. I have posted to the HttpArchive mailing list with the hope that future dumps of data can get the Authorization header value stripped out. This is the advantage of using a standard header. We know what it is called, we know that the information contained in it should not be shared and we can get no useful performance information from it, so we will not lose anything by removing it from the archives.
The biggest surprise to me was the fact that we also get API keys from HTTPS requests. WebPageTest is running on the client machine and can see the request in the browser as it is being made and before SSL encryption. All the query parameters and HTTP headers are all completely accessible to store.
If you can't afford to have someone misusing your API Key, then don't send it down to the client. HTTPS is not going to save you. And don't rely on security by obscurity. The world of big data is making it easier to expose and query massive amounts of data every day.
And finally, use the Authorization header for what it was intended and don't ever log it!