| # GitHub Self-Hosted Runner Configuration |
| |
| This directory contains configuration for setting up IREE's GitHub Actions |
| [self-hosted runners](https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners). |
| |
| The [`gcp/`](./gcp) directory contains scripts specific to setting up runners on |
| Google Cloud Platform (GCP). These are |
| [Managed Instance Groups](https://cloud.google.com/compute/docs/instance-groups) |
| that execute the [GitHub actions runner](https://github.com/actions/runner) as a |
| service initialized on startup. The scripts automate the creation of VM |
| [Instance Templates](http://cloud/compute/docs/instance-templates) and the |
| creation and update of the instance groups. These scripts are currently very |
| early stage and require editing to execute for different configurations rather |
| than taking flags. They mostly just automate some manual tasks and minimize |
| errors. Our GCP project is |
| [iree-oss](https://console.cloud.google.com/?project=iree-oss). |
| |
| Included in the `gcp` directory is the [startup script](./gcp/startup_script.sh) |
| that is configured to run when the VM instance starts up. It pulls in the rest |
| of the configuration from the [`config`](./config) directory at a specified |
| repository commit. |
| |
| The [`config/`](./config) directory contains configuration that is pulled into |
| the runner on VM startup. This configuration registers the runner with the |
| GitHub Actions control plane and then creates services to start the runner and |
| to deregister the runner on shutdown. When the runner service exits, it |
| initiates the shutdown of the VM, which triggers the deregister service. |
| |
| Also in the config directory is configuration of the runner itself. The entry |
| point is the [`runner.env`](./config/runner.env) file, which is symlinked into |
| the runner's `.env` file and directs the runner to run |
| [hooks before and after each job](https://docs.github.com/en/actions/hosting-your-own-runners/running-scripts-before-or-after-a-job). |
| We use these hooks to ensure a consistent environment for jobs executed on the |
| runner and to check that the job was triggered by an event that the runner is |
| allowed to process (for instance, postsubmit runners will refuse to run a job |
| triggered by a `pull_request` event). |
| |
| ## Ephemeral Runners and Autoscaling |
| |
| Our runners are ephemeral, which means that after executing a single job the |
| runner program exits. As noted above, the runner service triggers a shutdown of |
| the VM instance when the runner exits. This shutdown triggers the deregister |
| service which attempts to deregister the runner from the GitHub Actions control |
| plane. Note that if the runner stopped gracefully (i.e. after completing a job, |
| it's *supposed* to deregister itself automatically). This deregistration is to |
| catch other cases. It is best effort (as the instance can execute a non-graceful |
| shutdown), but the only downside to failed deregistration appears to be |
| "offline" runner entries hanging around in the UI. GitHub will garbage collect |
| these after a certain time period (30 days for normal runners and 1 day for |
| ephemeral runners), so deregistration is not critical. |
| |
| ### Runner Token Proxy |
| |
| Registering a GitHub Actions Runner requires a registration token. To obtain |
| such a token, you must have very broad access to either the organization or |
| repository you are registering it in. This access is too broad to grant to the |
| runners themselves. Therefore, we mediate the token acquisition through a proxy |
| hosted on [Google Cloud Run](https://cloud.google.com/run). The proxy has the |
| app token for a GitHub App with permission to manage self-hosted runners for the |
| "iree-org" GitHub organization. It receives requests from the runners when they |
| are trying to register or deregister and returns them the much more narrowly |
| scoped [de]registration token. We use |
| https://github.com/google-github-actions/github-runner-token-proxy for the |
| proxy. You can see its docs for more details. |
| |
| ## Service Accounts |
| |
| The presubmit and postsubmit runners run as different service accounts depending |
| on their trust level. Presubmit runners are "minimal" trust and postsubmit |
| runners are "basic" trust, so they run as |
| `github-runner-minimal-trust@iree-oss.iam.gserviceaccount.com` and |
| `github-runner-basic-trust@iree-oss.iam.gserviceaccount.com`, respectively. |
| |
| ## Passing Artifacts |
| |
| Using GitHub's [artifact actions](https://github.com/actions/upload-artifact) |
| with runners on GCE turns out to be prohibitively slow (see discussion in |
| https://github.com/iree-org/iree/issues/9881). Instead we use our own |
| [Google Cloud Storage](https://cloud.google.com/storage) (GCS) buckets to save |
| artifacts from jobs and fetch them in subsequent jobs: |
| `iree-github-actions-presubmit-artifacts` and |
| `iree-github-actions-postsubmit-artifacts`. Each runner group's service account |
| has acces only to the bucket for its group. Artifacts are indexed by the |
| workflow run id and attempt number, so that they do not collide. Subsequent jobs |
| should *not* make assumptions about where an artifact was stored however, |
| instead querying the outputs of the job that created it (which should always |
| provide such an output). This is both to promote DRY principles and for subtle |
| reasons like a rerun of a failed job may be on run attempt 2, but fetching |
| artifacts from a job dependency that succeeded on attempt 1 and therefore did |
| not rerun and recreate the artifacts indexed by the new attempt. |
| |
| ## Labels |
| |
| The GitHub Actions Runners are identified with |
| [labels](https://docs.github.com/en/enterprise-cloud@latest/actions/hosting-your-own-runners/using-labels-with-self-hosted-runners) |
| that indicate properties of the runner. Some of the labels are automatically |
| generated from information about the runner on startup, such as its GCP zone and |
| hostname, others match GitHub's standard labels, like the OS, and some are |
| injected as custom labels via metadata, like whether the VM is optimized for CPU |
| or GPU usage. All self-hosted runners receive the `self-hosted` label. |
| |
| Note that when setting where a job runs, any runner that has all the specified |
| labels can pick up a job. So if you leave off the runner-group, for instance, |
| the job will non-deterministically try to run on presubmit or postsubmit |
| runners. We do not currently have a solution for this problem other than careful |
| code authorship and review. |
| |
| ## Examining Runners |
| |
| The runners for iree-org can be viewed in the |
| [GitHub UI](https://github.com/organizations/iree-org/settings/actions/runners). |
| Unfortunately, only organization admins have access to this page. Organization |
| admin gives very broad privileges, so this set is necessarily kept very small by |
| Google security policy. |
| |
| ## Updating the Runners |
| |
| We frequently need to update the runner instances. In particular, after a Runner |
| release, the version of the program running on the runners must be updated |
| [within 30 days](https://docs.github.com/en/enterprise-cloud@latest/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#controlling-runner-software-updates-on-self-hosted-runners), |
| otherwise the GitHub control plane will refuse their connection. Testing and |
| rolling out these updates involves a few steps. |
| |
| ### MIG Rolling Updates |
| |
| See https://cloud.google.com/compute/docs/instance-groups/updating-migs for the |
| main documentation. There are two modes for a rolling MIG update, "proactive" and |
| "opportunistic" (AKA "selective"). There are also three different actions the |
| MIG can take to update an instance: "refresh", "restart", and "replace". A |
| "refresh" update only allows updating instance metadata or adding extra disks, |
| but is mostly safe to run as a "proactive" update. Instances will pick up the |
| changes to the startup script when they restart naturally. If there are changes |
| to metadata that is accessed outside of startup, make sure it's compatible with |
| the old configuration. If it's not or you need to change something like the boot |
| disk image, you need to do a replacement of the VM, which brings it down along |
| with any jobs it's in the middle of running. That means it is not safe to do a |
| "proactive" update. In an "opportunistic" update, the MIG is *supposed* to apply |
| the update when the instances are created, which would work great for us since |
| our instances shut themselves down when they're done with a job, but apparently |
| it *doesn't* apply updates if it's recreating an instance deemed "unhealthy" |
| which is unfortunately the case for instances that shut themselves down. So |
| these sorts of updates need to be done as "opportunistic" updates which will |
| need to be manually managed. You can use |
| [`remove_idle_runners.sh`](./gcp/remove_idle_runners.sh) to relatively safely |
| bring down instances that aren't currently processing a job. |
| |
| ### Test Runners |
| |
| We have groups of testing runners (tagged with the `environment=testing` label), |
| that can be used to deploy new runner configurations and can be tested by |
| targeting jobs using the label. Create templates using the |
| [`create_templates.sh`](./gcp/create_templates.sh) script, overriding the |
| `TEMPLATE_CONFIG_REPO` and/or `TEMPLATE_CONFIG_REF` environment variables to |
| point to your new configurations. The autoscaling configuration for the testing |
| group usually has both min and max replicas set to 0, so there aren't any |
| instances running. Update the configuration to something appropriate for your |
| testing (probably something like 1-10) using |
| [`update_autoscaling.sh`](./gcp/update_autoscaling.sh): |
| |
| ```shell |
| build_tools/github_actions/runner/gcp/update_autoscaling.sh \ |
| github-runner-testing-presubmit-cpu-us-west1 us-west1 1 10 |
| ``` |
| |
| Update the testing instance group to your new template using |
| [`update_instance_groups.py`](./gcp/update_instance_groups.py) (no need to |
| canary to the test group): |
| |
| ```shell |
| build_tools/github_actions/runner/gcp/update_instance_groups.py direct-update \ |
| --env=testing --region=all --group=all --type=all --version="${VERSION?}" |
| ``` |
| |
| Check that your runners successfully start up and register with the GitHub UI. |
| Then send a PR or trigger a workflow dispatch (depending on what you're testing) |
| targeting the testing environment, and ensure that your new runners work. Send |
| and merge a PR updating the runner configuration. When you're done, make sure to |
| set the testing group autoscaling back to 0-0. |
| |
| ```shell |
| build_tools/github_actions/runner/gcp/update_autoscaling.sh \ |
| github-runner-testing-presubmit-cpu-us-west1 us-west1 0 0 |
| ``` |
| |
| For now, you'll need to shut down the remaining runners. The easiest way to do |
| this for now is to go to the UI for the managed instance group and delete all |
| the instances. The necessity for manual deletion will go away with future |
| improvements. |
| |
| ```shell |
| build_tools/github_actions/runner/gcp/remove_idle_runners.sh \ |
| testing-presubmit cpu us-west1 |
| ``` |
| |
| ### Deploy to Prod |
| |
| Since the startup script used by the runners references a specific commit, |
| merging the PR will not immediately affect them. Note that this means that any |
| changes you make need to be forward and backward compatible with changes to |
| anything that is picked up directly from tip of tree (such as workflow files). |
| These should be changed in separate PRs. |
| |
| To deploy to prod, create new prod templates. Then use the update script to |
| canary to a single instance: |
| |
| ```shell |
| build_tools/github_actions/runner/update_instance_groups.py canary-update \ |
| --prod --region=all --group=all --type=all --version="${VERSION}" |
| ``` |
| |
| Watch to make sure that your new runners are starting up and registering as |
| expected and there aren't any additional failures. It is probably best to wait |
| on the order of days before proceeding. When you are satisfied that your new |
| configuration is good, complete the update with your new template: |
| |
| ```shell |
| build_tools/github_actions/runner/update_instance_groups.py promote-canary \ |
| --prod --region=all --group=all --type=all |
| ``` |
| |
| To speed things along, you may want to remove idle instances since they'll only |
| pick up the updates on restart. |
| |
| ## Known Issues / Future Work |
| |
| There are number of known issues and areas of improvement for the runners: |
| |
| - The only autoscaling currently uses CPU usage (the default), which does not |
| work at all for GPU-based runners. The GPU groups are set with minimum and |
| maximum autoscaling size set to the same value (this is slightly different |
| from being set to a fixed value for detailed reasons that I won't go into). We |
| need to set up autoscaling based on |
| [GitHub's job queueing webhooks](https://docs.github.com/en/enterprise-cloud@latest/actions/hosting-your-own-runners/autoscaling-with-self-hosted-runners#using-webhooks-for-autoscaling). |
| - The runners currently use a persistent disk, which results in relatively slow |
| IO. Due to errors made in setting these up, the disk image and disk for the |
| runners is 1TB, which is also wasteful and results in quota issues. |
| - If the runner fails to register (e.g. GitHub has a server error, which has |
| happened on multiple occasions), the VM will sit idle, running according to |
| the autoscaler. The runner doesn't have a any way to check its health status, |
| which would allow the autoscaler to recognize there was a problem and replace |
| the instance (https://github.com/actions/runner/issues/745). |
| - MIG autoscaling has the option to scale groups up and down. We currently have |
| it set to only scale up. When scaling down, the autoscaler just sends a |
| shutdown signal to the instance and it has |
| [90 seconds to run a shutdown script](https://cloud.google.com/compute/docs/shutdownscript), |
| but can't complete a long-running build. There is no functionality to send a |
| gentle shutdown signal. This is especially problematic given that we only have |
| CPU-usage based autoscaling at the moment because this is an imperfect measure |
| and in particular decides that an instance is idle if it is doing IO (e.g. |
| uploading artifacts). Job queue based autoscaling would probably help, but the |
| same problem would exist. We likely need to implement functionality in the |
| instance to shut itself down after some period of inactivity. |
| - MIG "opportunistic" or "selective" autoscaling will not update an instance |
| when it replaces it for being unhealthy. Unfortunately for us, this includes |
| cases where the instance shut itself down. This means that any updates |
| requiring a restart need to be managed manually. |