GitHub Secret Leakage Measurement

Master's research project focused on measuring the prevalence of credential leakage in public GitHub repositories.

How Bad Can It Git?
Characterizing Secret Leakage in Public GitHub Repositories

GitHub has been a fantastic tool for the development of free and open-source software. However, there is a major security risk that happens when developers leak authentication secrets within their public repositories, often inadvertantly. Our research team focused on demonstrating the first large-scale attack on secret leakage on GitHub. Our hope is that our work will raise awareness to the issue and influence platforms to implement better security measures.


Overview

The GitHub Secret Leakage Measurement is a research project that I came up with during a class on software security at NC State University. After presenting the initial prototype in the class, my professor, Dr. Bradley Reaves, approached me and asked to continue developing it as a research project. Along with Dr. Reaves, I worked with my partner Matt McNiece to continue work on this project.

Abstract

Contributions

Our work makes the following contributions:

Solution

We built a multi-stage pipeline for scanning for and analyzing potential secrets from GitHub. Our infrastructure was composed of Python software and a MongoDB instance.

Methodology

Our secret collection methodology involves various phases to identify secrets with high confidence

High-level summaries of each stage of our pipeline are below:

  1. We began by surveying a wide set of common APIs that have risk of high-impact if access were compromised. From this, we were able to identify that many APIs distributed secrets with a unique and identifiable format. For example, all Amazon AWS Access Key ID values start with the string AKIA.
  2. Next, we scanned multiple resources for these secrets.
    1. Our primary resource was GitHub's Search API, which is used for searching code on their platform. We identified that this API allowed near real-time searches of recent commits published to GitHub. We were able to craft specialized queries to search for the identifiable secrets we had previously identified. This resource essentially gave us live search results from actively developed repositories, which means that API keys were likely valid.
    2. Our second resource was a Google BigQuery snapshot of GitHub open-source licenses repositories. This was a queryably weekly snapshot that we were able to search with a limited regular expression feature set. This provided highly flexible and powerful search approach compared to the API. However, since the resource was a snapshot, the repositories were not guaranteed to be actively developed and so the API keys were less likely valid.
  3. The scanning step provided us with a large dataset of millions of potential secrets. Due to the limitations of the search functionality, some of the detected secrets may have been false positives. Therefore, we further scanned these secrets offline using regular expressions to ensure higher accuracy and extract the secrets themselves. We called the successful outputs of this phase "candidate secrets".
  4. After obtaining our filtered list of candidate secrets, we now wanted to ensure that the secrets were "valid". For ethical reasons, we would not attempt to use the secrets, and so we built a set of validity filters that would give us high confidence of validity. As an example, we built an entropy filter that would ensure that the secrets exhibited a high degree of randomness.

Once we had our final set of valid secrets, we were able to perform data analysis.

Analysis

Below, I have highlighted some of our select findings from the analysis:

Short-term monitoring of secrets

Many secrets are removed in the first few hours after being committed, but the majority remain

Long-term monitoring of secrets

Secrets that still exist on GitHub for a day after commit tend to stay on GitHub indefinitely

Publication

I was the primary author of our paper, "How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories". Our paper was accepted for publication by the Network and Distributed Systems Security (NDSS) conference, one of the top-tier academic security conferences. I presented our work at the NDSS Symposium in San Diego, California in Feburary 2019.

Michael presenting at NDSS 2019

Michael presenting at NDSS 2019

Impacts

One of the goals of our paper was to draw attention to the problem and pressure platforms such as GitHub to implement measures on their site to prevent secret leakage. I was very pleased to see that after the publication of our paper, GitHub announced a new feature to implement secret scanning within repositories. In 2024, GitHub also enabled secret scanning on push by default. This is a fantastic feature and I am very glad to know that my work has led to such important changes.