This directory contains configuration for setting up IREE's GitHub Actions self-hosted runners.
The gcp/
directory contains scripts specific to setting up runners on Google Cloud Platform (GCP). These are Managed Instance Groups that execute the GitHub actions runner as a service initialized on startup. The scripts automate the creation of VM [Images](http://cloud/compute/docs/images and Instance Templates and the creation and update of the instance groups. These scripts mostly just automate some manual tasks and minimize errors. Our GCP project is iree-oss.
Included in the gcp
directory is the startup script that is configured to run when the VM instance starts up. It pulls in the rest of the configuration from the config
directory at a specified repository commit.
The config/
directory contains configuration that is pulled into the runner on VM startup. This configuration registers the runner with the GitHub Actions control plane and then creates services to start the runner and to deregister the runner on shutdown. When the runner service exits, it initiates the shutdown of the VM, which triggers the deregister service.
Also in the config directory is configuration of the runner itself. The entry point is the runner.env
file, which is symlinked into the runner's .env
file and directs the runner to run hooks before and after each job. We use these hooks to ensure a consistent environment for jobs executed on the runner and to check that the job was triggered by an event that the runner is allowed to process (for instance, postsubmit runners will refuse to run a job triggered by a pull_request
event).
Our runners are ephemeral, which means that after executing a single job the runner program exits. As noted above, the runner service triggers a shutdown of the VM instance when the runner exits. This shutdown triggers the deregister service which attempts to deregister the runner from the GitHub Actions control plane. Note that if the runner stopped gracefully (i.e. after completing a job, it's supposed to deregister itself automatically). This deregistration is to catch other cases. It is best effort (as the instance can execute a non-graceful shutdown), but the only downside to failed deregistration appears to be “offline” runner entries hanging around in the UI. GitHub will garbage collect these after a certain time period (30 days for normal runners and 1 day for ephemeral runners), so deregistration is not critical.
Registering a GitHub Actions Runner requires a registration token. To obtain such a token, you must have very broad access to either the organization or repository you are registering it in. This access is too broad to grant to the runners themselves. Therefore, we mediate the token acquisition through a proxy hosted on Google Cloud Run. The proxy has the app token for a GitHub App with permission to manage self-hosted runners for the “iree-org” GitHub organization. It receives requests from the runners when they are trying to register or deregister and returns them the much more narrowly scoped [de]registration token. We use https://github.com/google-github-actions/github-runner-token-proxy for the proxy. You can see its docs for more details.
The presubmit and postsubmit runners run as different service accounts depending on their trust level. Presubmit runners are “minimal” trust and postsubmit runners are “basic” trust, so they run as github-runner-minimal-trust@iree-oss.iam.gserviceaccount.com
and github-runner-basic-trust@iree-oss.iam.gserviceaccount.com
, respectively.
Using GitHub‘s artifact actions with runners on GCE turns out to be prohibitively slow (see discussion in https://github.com/iree-org/iree/issues/9881). Instead we use our own Google Cloud Storage (GCS) buckets to save artifacts from jobs and fetch them in subsequent jobs: iree-github-actions-presubmit-artifacts
and iree-github-actions-postsubmit-artifacts
. Each runner group’s service account has acces only to the bucket for its group. Artifacts are indexed by the workflow run id and attempt number, so that they do not collide. Subsequent jobs should not make assumptions about where an artifact was stored however, instead querying the outputs of the job that created it (which should always provide such an output). This is both to promote DRY principles and for subtle reasons like a rerun of a failed job may be on run attempt 2, but fetching artifacts from a job dependency that succeeded on attempt 1 and therefore did not rerun and recreate the artifacts indexed by the new attempt.
The GitHub Actions Runners are identified with labels that indicate properties of the runner. Some of the labels are automatically generated from information about the runner on startup, such as its GCP zone and hostname, others match GitHub's standard labels, like the OS, and some are injected as custom labels via metadata, like whether the VM is optimized for CPU or GPU usage. All self-hosted runners receive the self-hosted
label.
Note that when setting where a job runs, any runner that has all the specified labels can pick up a job. So if you leave off the runner-group, for instance, the job will non-deterministically try to run on presubmit or postsubmit runners. We do not currently have a solution for this problem other than careful code authorship and review.
The runners for iree-org can be viewed in the GitHub UI. Unfortunately, only organization admins have access to this page. Organization admin gives very broad privileges, so this set is necessarily kept small.
We frequently need to update the runner instances. In particular, after a Runner release, the version of the program running on the runners must be updated within 30 days, otherwise the GitHub control plane will refuse their connection. Testing and rolling out these updates involves a few steps. Performing the runner update is assisted by the script update_instance_groups.py
.
The GCP API only allows querying MIGs by region, so the script has to perform a separate call for every region of interest. It is therefore useful to limit the regions to only those in which we operate. Right now, that is only the US, so you can pass a regex like us-\w+
to the regions argument in the commands below. If we start running in non-US regions, make sure to update these commands!
For updating the runner version in particular, you can use update_runner_version.py
and skip deployment to test runners, going straight to a prod canary.
See https://cloud.google.com/compute/docs/instance-groups/updating-migs for the main documentation. There are two modes for a rolling MIG update, “proactive” and “opportunistic” (AKA “selective”). There are also three different actions the MIG can take to update an instance: “refresh”, “restart”, and “replace”. A “refresh” update only allows updating instance metadata or adding extra disks, but is mostly safe to run as a “proactive” update. In our case, instances will pick up changes to the startup script when they restart naturally. If you need to change something like the boot disk image, you need to do a replacement of the VM, but in this case a “proactive” update is not safe because it would shut down the VM even if it was in the middle of running a job. In an “opportunistic” update, the MIG is supposed to apply the update when the instances are created, but it doesn't apply updates if it‘s recreating an instance deemed “unhealthy”, which includes if the instance shuts itself down or fails its health check. There is also a restriction that you can have only one “in-progress” update at a time. This can lead to some weird states where instances are bootlooping and you can’t update them. In this case, you can manually delete the misbehaving instances and try to get back to everything on a good version.
In general, the recommended approach (which the scripting defaults to) is to do updates as opportunistic VM replacement. With refresh, a running VM can end up with a mismatch between the template it says it‘s using and commit it’s actually configured from, which makes it difficult to track rollout state. The speed of refresh updates is a bit of a false one, as for the update to fully take affect for anything that happens as part of the startup script (which is basically everything, in our case) the VM has to restart anyway.
Opportunistic updates can be slow because VMs generally only get deleted when they complete a job. To speed them along, you can use remove_idle_runners.sh
to relatively safely bring down instances that aren't currently processing a job.
We have groups of testing runners (tagged with the environment=testing
label), that can be used to deploy new runner configurations and can be tested by targeting jobs using the label. Create templates using the create_templates.sh
script, overriding the TEMPLATE_CONFIG_REPO
and/or TEMPLATE_CONFIG_REF
environment variables to point to your new configurations. The autoscaling configuration for the testing group usually has both min and max replicas set to 0, so there aren't any instances running. Update the configuration to something appropriate for your testing (probably something like 1-10) using update_autoscaling.sh
:
build_tools/github_actions/runner/gcp/update_autoscaling.sh \ github-runner-testing-presubmit-cpu-us-west1 us-west1 1 10
Update the testing instance group to your new template (no need to canary to the test group):
build_tools/github_actions/runner/gcp/update_instance_groups.py direct-update \ --env=testing --region='us-\w+' --group=all --type=all --version="${VERSION?}"
Check that your runners successfully start up and register with the GitHub UI. Then send a PR or trigger a workflow dispatch (depending on what you‘re testing) targeting the testing environment, and ensure that your new runners work. Send and merge a PR updating the runner configuration. When you’re done, make sure to set the testing group autoscaling back to 0-0.
build_tools/github_actions/runner/gcp/update_autoscaling.sh \ github-runner-testing-presubmit-cpu-us-west1 us-west1 0 0
You'll also need to delete the remaining runners because without jobs to process, they will never delete themselves.
build_tools/github_actions/runner/gcp/remove_idle_runners.sh \ testing-presubmit cpu us-west1
Since the startup script used by the runners references a specific commit, merging the PR will not immediately affect them. Note that this means that any changes you make need to be forward and backward compatible with changes to anything that is picked up directly from tip of tree (such as workflow files). These should be changed in separate PRs.
To deploy to prod, create new prod templates. Then canary the new template for one instance in each group.
Note: The a100 groups are special. We only run one instance in each group and have one of every type in every region, so canarying within a single instance group doesn't really make any sense. Also, we use the balanced
target distribution shape, which theoretically means that the group manager will avoid zones with no available capacity (which happens a lot). This distribution shape is for some reason incompatible with having multiple templates. So in the canarying below, we treat these differently.
build_tools/github_actions/runner/gcp/update_instance_groups.py canary \ --env=prod --region='us-\w+' --group=all --type='[^a]\w+' \ --version="${VERSION}" build_tools/github_actions/runner/gcp/update_instance_groups.py direct-update \ --env=prod --region='us-central1' --group=all --type=a100 \ --version="${VERSION}"
Watch to make sure that your new runners are starting up and registering as expected and there aren't any additional failures. It is probably best to wait on the order of days before proceeding. When you are satisfied that your new configuration is good, complete the update with your new template:
build_tools/github_actions/runner/gcp/update_instance_groups.py promote-canary \ --env=prod --region='us-\w+' --group=all --type='[^a]\w+' build_tools/github_actions/runner/gcp/update_instance_groups.py direct-update \ --env=prod --region='us-\w+' --group=all --type=a100 \ --version="${VERSION}"
You can monitor the state of rollouts via the GitHub API. This requires elevated permissions in the organizations. A command like this would help see how many runners are still running the old version
gh api --paginate '/orgs/iree-org/actions/runners?per_page=100' \ | jq --raw-output \ ".runners[] | select(.labels | map(.name == \"runner-version=${OLD_RUNNER_VERSION?}\") | any) | .name"
There are number of known issues and areas of improvement for the runners: