1.0.3 • Published 4 months ago

gh-org-scan v1.0.3

Weekly downloads
-
License
Apache-2.0
Repository
-
Last release
4 months ago

Watchtower

This repo is a tool for scanning Github organizations to gather large amounts of data about the repos in the org. It first gathers gets info about the repos in an org, then retrieves branch info for every branch of ever repo. The script then downloads each branch as a zip file to memory and runs certain rules on it to parse information from files on that branch. Reports are then ran on the information obtained from parsing the files.

Definitions

Rules

A rule is a function that gathers data. The data can be gathered from an API call, by scanning the downloaded zip file of a branch, or by other means. This data is then written to a JSON file to act as a cache.

There are three types of rules:

  • Branch Rules: Gather data on a branch
  • Secondary Branch Rules: Run after Branch Rules, rely on data gathered during branch rules (for example, we cannot tell if a branch is deployed through GH Actions until we parse GHA files using the dotGithubRule, therefore it is a secondary rule)
  • Org Rules: These rules make a single API call to the whole org, then map the data to a repo. This saves us hundreds of API calls.

Stale Branches

A stale branch is a branch that is not a default branch, protected branch, deployed branch, or a branch recently committed to. Default and protected branches are attributes set in a repo's settings. This script decides that a branch is "deployed" if the branch is listed in a GHA file called "deploy.yml" on the default branch. A branch is otherwise stale if it has not had a new commit in 30 days, although this 30 day value can be changed using the STALE_DAYS_THRESHOLD environment variable.

Report Outputs

Some things, like the reporting if a repo is public or internal, can be represented in a single csv file very simply. Other things are more complicated. For example, when reporting on the lowest node version in a repo, which branch or branches should be considered in the report? And even more difficult is how to report on dependency versions when there are thousands of individual library dependencies in an org.

To solve these issues, this script outputs different csv files in different ways:

  • Simple Reports: these reports can be output to a single csv file.
  • Versioning Reports: these reports (like node and terraform version) are contained in a subdirectory that contains four files:
    1. The lowest and highest version on every relevant branch in the org (each row is a branch)
    2. The lowest and highest version on each non-stale branch in the org (each row is a branch)
    3. The lowest and highest version in each repo, considering every branch in the repo (each row is a repo)
    4. The lowest and highest version in each repo, considering only non-stale branches in the repo (each row is a repo)
  • Dependency Reports: These are reports for dependencies that cannot be enumerated (like every npm dependency in the org). They are in a subdirectory with a csv file matching the dependency name. Each row in that csv file corresponds to a branch using that dependency, and in ach row we record the version of the dependency found on that branch.

Overall Health Score

Most reports contribute to an overall heath score for each repo. These scores are calculated like GPA, where each contributing report has a weight and a letter grade associated with it.

Each report calculates its own grade. Reports that do not apply to a repo do not affect the repo's overall score. The three report types generally calculate a grade in the following ways:

  • Simple: simple report graded are calculated by comparing the actual value to an optimal value.

  • Version: we use the lowest version on any branch of a repo and compare it to an optimal version to find a grade.

  • Dependency Reports: these are more difficult to grade. Theoretically we could try to find the most recent version of every dependency, but this could be difficult. Instead, we use a relative grading scheme. We look at the version of a given dependency on a branch of some repo, and then compare it to the highest and lowest of that dependency in the org. This comparison gives us a letter grade by finding the spread between the overall highest and lowest versions and grading on a scale between that spread. Unique dependencies used by a single repo are graded as a C. Each of these dependency grades contributes to the overall grade for the dependency environment for a repo, so that each npm dependency would contribute to the overall npm grade.

The overall health score for each repo is written to its own csv file called "overallHealthReport.csv".

Reports

Reports are functions that aggregate the data gathered by a rule and output it to a csv file. They run very quickly and therefore have no need to be cached.

ReportTypeDescriptionContributes to Overall Healthscore ReportWeight
codeScanAlertsSimpleGenerates several CSV files found in the "CodeScanAlerts" and CodeScanAlertsCount subdirectories, giving information about GH Advanced Security code scanning alerts at each level: critical, high, medium,and low. The CodeScanAlertsCount directory just contains the amount of alerts at each level for each repo, while the CodeScanAlerts dir contains more detailed info.Yes5
dependabotAlertsSimpleGenerates several CSV files found in the "DependabotAlerts" and DependabotAlertsCount subdirectories, giving information about GH Advanced Security dependabot alerts at each level: critical, high, medium,and low. The DependabotAlertsCount directory just contains the amount of alerts at each level for each repo, while the DependabotAlerts dir contains more detailed info.Yes5
dependabotBranchSimpleGenerates a csv file called "DependabotBranchReport.csv" with the amount of dependabot branches on every repoYes3
devPrdBranchesSimpleGenerates a csv file called "devPrdBranchReport.csv" listing repos without the standard dev/prd branch default and naming schemeNoN/A
DockerfileImageDependencyGenerates csv files in the "dockerfileImages" subdirectory corresponding to all the images in the orgYes3
ghActionModuleDependencyGenerates csv files in the "GHAModules" subdirectory corresponding to all the modules in the orgYes3
languageSimpleGenerates a csv file called "LanguageReport.csv" listing the primary language in each repoNoN/A
lowFilesSimpleGenerates two csv files. One called "LowFileCountInRepoReport.csv" which lists the repos with a low (<5) file count on every branch. The second file, "LowFileCountOnBranchReport.csv", lists every branch in the org with a low file countYes1
nodeVersionVersionGenerates a subdirectory "node" with four csv filesYes5
npmDependencyDependencyGenerates csv files in the "NPMDependencies" subdirectory corresponding to all the deps in the orgYes3
publicAndInternalSimpleGenerates a csv file called "PublicAndInternalReport.csv" listing repos that are public or internalYes2
reposWithoutNewCommitsSimpleGenerates a csv file called "ReposWithoutNewCommitsReport.csv" listing repos without a new commit in the last two yearsYes1
secretScanningAlertsSimpleGenerates several CSV files found in the "SecretAlerts" and "SecretAlertsCount" subdirectories, the SecretAlertsCount directory contains a file listing the number of secrets per repo, and the SecretAlerts directory contains the specific information for every secretYes5
staleBranchSimpleGenerates a csv file called "StaleBranchReport.csv" listing the number of stale branches on every repo in the orgYes3
teamlessRepoSimpleGenerates a csv file called "TeamlessRepoReport.csv" listing the repos in the org that do not have an admin team in GithubYes4
terraformModuleDependencyGenerates csv files in the "terraformModules" subdirectory corresponding to all the tf modules in the orgYes3
terraformVersionVersionGenerates a subdirectory "terraform" with four csv filesYes5

Running Locally

Env Vars: You can copy the below environment variables into your run configuration

ENVIRONMENT_NAME=dev
GITHUB_ORG=<your-org>;
GITHUB_TOKEN=<you-token>;
STALE_DAYS_THRESHOLD=30;

GITHUB_ORG and GITHUB_TOKEN are required variables. STALE_DAYS_THRESHOLD and ENVIRONMENT_NAME are not required, and default to "30" and "dev" respectively.

This tool will work best if your GITHUB_TOKEN is a token associated with admin privileges over your organization, otherwise certain rules (getting Code Scanning results and admin teams for example) may not function properly.

After adding these environment variables to your run configuration, the tool can be started by running the below command:

node --env-file=.env -r ts-node/register src/index.ts

or

npm run dev

Todos

  1. Create terraform/run in the cloud
  2. Make reports more object-oriented
  3. Non npm repos getting 0 on npm dep grade? Possible bug
  4. Improve caching
  5. Create more run options (run without cache, etc.)
  6. Dynamic LTS versioning for things besides node
  7. Deploy automatically to npm with GH Actions