Using a remote cache to speed up mypy runs

Mypy performs type checking incrementally, reusing results fromprevious runs to speed up successive runs. If you are type checking alarge codebase, mypy can still be sometimes slower than desirable. Forexample, if you create a new branch based on a much more recent committhan the target of the previous mypy run, mypy may have toprocess almost every file, as a large fraction of source files mayhave changed. This can also happen after you’ve rebased a localbranch.

Mypy supports using a remote cache to improve performance in casessuch as the above. In a large codebase, remote caching can sometimesspeed up mypy runs by a factor of 10, or more.

Mypy doesn’t include all components needed to setthis up – generally you will have to perform some simple integrationwith your Continuous Integration (CI) or build system to configuremypy to use a remote cache. This discussion assumes you have a CIsystem set up for the mypy build you want to speed up, and that youare using a central git repository. Generalizing to differentenvironments should not be difficult.

Here are the main components needed:

  • A shared repository for storing mypy cache files for all landed commits.
  • CI build that uploads mypy incremental cache files to the shared repository foreach commit for which the CI build runs.
  • A wrapper script around mypy that developers use to run mypy with remotecaching enabled.

Below we discuss each of these components in some detail.

Shared repository for cache files

You need a repository that allows you to upload mypy cache files fromyour CI build and make the cache files available for download based ona commit id. A simple approach would be to produce an archive of the.mypycache directory (which contains the mypy cache data) as adownloadable _build artifact from your CI build (depending on thecapabilities of your CI system). Alternatively, you could upload thedata to a web server or to S3, for example.

Continuous Integration build

The CI build would run a regular mypy build and create an archive containingthe .mypy_cache directory produced by the build. Finally, it will producethe cache as a build artifact or upload it to a repository where it isaccessible by the mypy wrapper script.

Your CI script might work like this:

  • Run mypy normally. This will generate cache data under the.mypy_cache directory.
  • Create a tarball from the .mypy_cache directory.
  • Determine the current git master branch commit id (say, usinggit rev-parse HEAD).
  • Upload the tarball to the shared repository with a name derived from thecommit id.

Mypy wrapper script

The wrapper script is used by developers to run mypy locally duringdevelopment instead of invoking mypy directly. The wrapper firstpopulates the local .mypy_cache directory from the sharedrepository and then runs a normal incremental build.

The wrapper script needs some logic to determine the most recentcentral repository commit (by convention, the origin/master branchfor git) the local development branch is based on. In a typical gitsetup you can do it like this:

  1. git merge-base HEAD origin/master

The next step is to download the cache data (contents of the.mypy_cache directory) from the shared repository based on thecommit id of the merge base produced by the git command above. Thescript will decompress the data so that mypy will start with a fresh.mypy_cache. Finally, the script runs mypy normally. And that’s all!

Caching with mypy daemon

You can also use remote caching with the mypy daemon.The remote cache will significantly speed up the first dmypy checkrun after starting or restarting the daemon.

The mypy daemon requires extra fine-grained dependency data inthe cache files which aren’t included by default. To use caching withthe mypy daemon, use the —cache-fine-grained option in your CIbuild:

  1. $ mypy --cache-fine-grained <args...>

This flag adds extra information for the daemon to the cache. Inorder to use this extra information, you will also need to use the—use-fine-grained-cache option with dmypy start ordmypy restart. Example:

  1. $ dmypy start -- --use-fine-grained-cache <options...>

Now your first dmypy check run should be much faster, as it can usecache information to avoid processing the whole program.

Refinements

There are several optional refinements that may improve things further,at least if your codebase is hundreds of thousands of lines or more:

  • If the wrapper script determines that the merge base hasn’t changedfrom a previous run, there’s no need to download the cache data andit’s better to instead reuse the existing local cache data.
  • If you use the mypy daemon, you may want to restart the daemon each timeafter the merge base or local branch has changed to avoid processing apotentially large number of changes in an incremental build, as this canbe much slower than downloading cache data and restarting the daemon.
  • If the current local branch is based on a very recent master commit,the remote cache data may not yet be available for that commit, asthere will necessarily be some latency to build the cache files. Itmay be a good idea to look for cache data for, say, the 5 latestmaster commits and use the most recent data that is available.
  • If the remote cache is not accessible for some reason (say, from a publicnetwork), the script can still fall back to a normal incremental build.
  • You can have multiple local cache directories for different local branchesusing the —cache-dir option. If the user switches to an existingbranch where downloaded cache data is already available, you can continueto use the existing cache data instead of redownloading the data.
  • You can set up your CI build to use a remote cache to speed up theCI build. This would be particularly useful if each CI build startsfrom a fresh state without access to cache files from previousbuilds. It’s still recommended to run a full, non-incrementalmypy build to create the cache data, as repeatedly updating cachedata incrementally could result in drift over a long time period (dueto a mypy caching issue, perhaps).