Contributing to Ceph: A Guide for Developers

  • Author
  • Loic Dachary

  • Author

  • Nathan Cutler

  • License

  • Creative Commons Attribution Share Alike 3.0 (CC-BY-SA-3.0)

Note

You may also be interested in the Ceph Internals documentation.

Introduction

This guide has two aims. First, it should lower the barrier to entry forsoftware developers who wish to get involved in the Ceph project. Second,it should serve as a reference for Ceph developers.

We assume that readers are already familiar with Ceph (the distributedobject store and file system designed to provide excellent performance,reliability and scalability). If not, please refer to the project websiteand especially the publications list. Another way to learn about what’s happeningin Ceph is to check out our youtube channel , where we post Tech Talks, Code walk-throughsand Ceph Developer Monthly recordings.

Since this document is to be consumed by developers, who are assumed tohave Internet access, topics covered elsewhere, either within the Cephdocumentation or elsewhere on the web, are treated by linking. If younotice that a link is broken or if you know of a better link, pleasereport it as a bug.

Essentials (tl;dr)

This chapter presents essential information that every Ceph developer needsto know.

Leads

The Ceph project is led by Sage Weil. In addition, each major projectcomponent has its own lead. The following table shows all the leads andtheir nicks on GitHub:

ScopeLeadGitHub nick
CephSage Weilliewegas
RADOSNeha Ojhaneha-ojha
RGWYehuda Sadehyehudasa
RGWMatt Benjaminmattbenjamin
RBDJason Dillamandillaman
CephFSPatrick Donnellybatrick
DashboardLenz GrimmerLenzGr
MONJoao Luisjecluis
Build/OpsKen Dreyerktdreyer

The Ceph-specific acronyms in the table are explained inArchitecture.

History

See the History chapter of the Wikipedia article.

Licensing

Ceph is free software.

Unless stated otherwise, the Ceph source code is distributed under theterms of the LGPL2.1 or LGPL3.0. For full details, see the fileCOPYING in the top-level directory of the source-code tree.

Source code repositories

The source code of Ceph lives on GitHub in a number of repositories belowthe Ceph “organization”.

To make a meaningful contribution to the project as a developer, a workingknowledge of git is essential.

Although the Ceph “organization” includes several software repositories,this document covers only one: https://github.com/ceph/ceph.

Redmine issue tracker

Although GitHub is used for code, Ceph-related issues (Bugs, Features,Backports, Documentation, etc.) are tracked at http://tracker.ceph.com,which is powered by Redmine.

The tracker has a Ceph project with a number of subprojects looselycorresponding to the various architectural components (seeArchitecture).

Mere registration in the tracker automatically grants permissionssufficient to open new issues and comment on existing ones.

To report a bug or propose a new feature, jump to the Ceph project andclick on New issue.

Mailing list

Ceph development email discussions take place on the mailing listceph-devel@vger.kernel.org. The list is open to all. Subscribe bysending a message to majordomo@vger.kernel.org with the line:

  1. subscribe ceph-devel

in the body of the message.

There are also other Ceph-related mailing lists.

IRC

In addition to mailing lists, the Ceph community also communicates in realtime using Internet Relay Chat.

See https://ceph.com/irc/ for how to set up your IRCclient and a list of channels.

Submitting patches

The canonical instructions for submitting patches are contained in thefile CONTRIBUTING.rst in the top-level directory of the source-codetree. There may be some overlap between this guide and that file.

All newcomers are encouraged to read that file carefully.

Building from source

See instructions at Build Ceph.

Using ccache to speed up local builds

Rebuilds of the ceph source tree can benefit significantly from use of ccache.Many a times while switching branches and such, one might see build failures forcertain older branches mostly due to older build artifacts. These rebuilds cansignificantly benefit the use of ccache. For a full clean source tree, one coulddo

  1. $ make clean
  2.  
  3. # note the following will nuke everything in the source tree that
  4. # isn't tracked by git, so make sure to backup any log files /conf options
  5.  
  6. $ git clean -fdx; git submodule foreach git clean -fdx

ccache is available as a package in most distros. To build ceph with ccache onecan:

  1. $ cmake -DWITH_CCACHE=ON ..

ccache can also be used for speeding up all builds in the system. for moredetails refer to the run modes of the ccache manual. The default settings ofccache can be displayed with ccache -s.

Note

It is recommended to override the max_size, which is the size ofcache, defaulting to 10G, to a larger size like 25G or so. Refer to theconfiguration section of ccache manual.

To further increase the cache hit rate and reduce compile times in a developmentenvironment, it is possible to set version information and build timestamps tofixed values, which avoids frequent rebuilds of binaries that contain thisinformation.

This can be achieved by adding the following settings to the ccacheconfiguration file ccache.conf:

  1. sloppiness = time_macros
  2. run_second_cpp = true

Now, set the environment variable SOURCE_DATE_EPOCH to a fixed value (a UNIXtimestamp) and set ENABLE_GIT_VERSION to OFF when running cmake:

  1. $ export SOURCE_DATE_EPOCH=946684800
  2. $ cmake -DWITH_CCACHE=ON -DENABLE_GIT_VERSION=OFF ..

Note

Binaries produced with these build options are not suitable forproduction or debugging purposes, as they do not contain the correct buildtime and git version information.

Development-mode cluster

See Developer Guide (Quick).

Kubernetes/Rook development cluster

See Hacking on Ceph in Kubernetes with Rook

Backporting

All bugfixes should be merged to the master branch before being backported.To flag a bugfix for backporting, make sure it has a tracker issueassociated with it and set the Backport field to a comma-separated list ofprevious releases (e.g. “hammer,jewel”) that you think need the backport.The rest (including the actual backporting) will be taken care of by theStable Releases and Backports team.

Guidance for use of cluster log

If your patches emit messages to the Ceph cluster log, please consultthis guidance: Use of the cluster log.

What is merged where and when ?

Commits are merged into branches according to criteria that changeduring the lifecycle of a Ceph release. This chapter is the inventoryof what can be merged in which branch at a given point in time.

Development releases (i.e. x.0.z)

What ?

  • features

  • bug fixes

Where ?

Features are merged to the master branch. Bug fixes should be mergedto the corresponding named branch (e.g. “jewel” for 10.0.z, “kraken”for 11.0.z, etc.). However, this is not mandatory - bug fixes can bemerged to the master branch as well, since the master branch isperiodically merged to the named branch during the developmentreleases phase. In either case, if the bugfix is important it can alsobe flagged for backport to one or more previous stable releases.

When ?

After the stable release candidates of the previous release entersphase 2 (see below). For example: the “jewel” named branch wascreated when the infernalis release candidates entered phase 2. Fromthis point on, master was no longer associated with infernalis. Assoon as the named branch of the next stable release is created, masterstarts getting periodically merged into it.

Branch merges

  • The branch of the stable release is merged periodically into master.

  • The master branch is merged periodically into the branch of thestable release.

  • The master is merged into the branch of the stable releaseimmediately after each development x.0.z release.

Stable release candidates (i.e. x.1.z) phase 1

What ?

  • bug fixes only

Where ?

The branch of the stable release (e.g. “jewel” for 10.0.z, “kraken”for 11.0.z, etc.) or master. Bug fixes should be merged to the namedbranch corresponding to the stable release candidate (e.g. “jewel” for10.1.z) or to master. During this phase, all commits to master will bemerged to the named branch, and vice versa. In other words, it makesno difference whether a commit is merged to the named branch or tomaster - it will make it into the next release candidate either way.

When ?

After the first stable release candidate is published, i.e. after thex.1.0 tag is set in the release branch.

Branch merges

  • The branch of the stable release is merged periodically into master.

  • The master branch is merged periodically into the branch of thestable release.

  • The master is merged into the branch of the stable releaseimmediately after each x.1.z release candidate.

Stable release candidates (i.e. x.1.z) phase 2

What ?

  • bug fixes only

Where ?

The branch of the stable release (e.g. “jewel” for 10.0.z, “kraken”for 11.0.z, etc.). During this phase, all commits to the named branchwill be merged into master. Cherry-picking to the named branch duringrelease candidate phase 2 is done manually since the officialbackporting process only begins when the release is pronounced“stable”.

When ?

After Sage Weil decides it is time for phase 2 to happen.

Branch merges

  • The branch of the stable release is merged periodically into master.

Stable releases (i.e. x.2.z)

What ?

  • bug fixes

  • features are sometime accepted

  • commits should be cherry-picked from master when possible

  • commits that are not cherry-picked from master must be about a bug unique to the stable release

  • see also the backport HOWTO

Where ?

The branch of the stable release (hammer for 0.94.x, infernalis for 9.2.x, etc.)

When ?

After the stable release is published, i.e. after the “vx.2.0” tag isset in the release branch.

Branch merges

Never

Issue tracker

See Redmine issue tracker for a brief introduction to the Ceph Issue Tracker.

Ceph developers use the issue tracker to

  1. keep track of issues - bugs, fix requests, feature requests, backportrequests, etc.

  2. communicate with other developers and keep them informed as workon the issues progresses.

Issue tracker conventions

When you start working on an existing issue, it’s nice to let the otherdevelopers know this - to avoid duplication of labor. Typically, this isdone by changing the Assignee field (to yourself) and changing theStatus to In progress. Newcomers to the Ceph community typically do nothave sufficient privileges to update these fields, however: they cansimply update the issue with a brief note.

Meanings of some commonly used statuses
StatusMeaning
NewInitial status
In ProgressSomebody is working on it
Need ReviewPull request is open with a fix
Pending BackportFix has been merged, backport(s) pending
ResolvedFix and backports (if any) have been merged

Basic workflow

The following chart illustrates basic development workflow:

Developer Guide - 图1

Below we present an explanation of this chart. The explanation is writtenwith the assumption that you, the reader, are a beginning developer whohas an idea for a bugfix, but do not know exactly how to proceed. Watchthe Getting Started with Ceph Development video fora practical summary of the same.

Update the tracker

Before you start, you should know the Issue tracker number of the bugyou intend to fix. If there is no tracker issue, now is the time to createone.

The tracker is there to explain the issue (bug) to your fellow Cephdevelopers and keep them informed as you make progress toward resolution.To this end, then, provide a descriptive title as well as sufficientinformation and details in the description.

If you have sufficient tracker permissions, assign the bug to yourself bychanging the Assignee field. If your tracker permissions have not yetbeen elevated, simply add a comment to the issue with a short message like“I am working on this issue”.

Upstream code

This section, and the ones that follow, correspond to the nodes in theabove chart.

The upstream code lives in https://github.com/ceph/ceph.git, which issometimes referred to as the “upstream repo”, or simply “upstream”. As thechart illustrates, we will make a local copy of this code, modify it, testour modifications, and submit the modifications back to the upstream repofor review.

A local copy of the upstream code is made by

  • forking the upstream repo on GitHub, and

  • cloning your fork to make a local working copy

See the the GitHub documentation fordetailed instructions on forking. In short, if your GitHub username is“mygithubaccount”, your fork of the upstream repo will show up athttps://github.com/mygithubaccount/ceph. Once you have created your fork,you clone it by doing:

  1. $ git clone https://github.com/mygithubaccount/ceph

While it is possible to clone the upstream repo directly, in this case youmust fork it first. Forking is what enables us to open a GitHub pullrequest.

For more information on using GitHub, refer to GitHub Help.

Local environment

In the local environment created in the previous step, you now have acopy of the master branch in remotes/origin/master. Since the fork(https://github.com/mygithubaccount/ceph.git) is frozen in time and theupstream repo (https://github.com/ceph/ceph.git, typically abbreviated toceph/ceph.git) is updated frequently by other developers, you will needto sync your fork periodically. To do this, first add the upstream repo asa “remote” and fetch it:

  1. $ git remote add ceph https://github.com/ceph/ceph.git
  2. $ git fetch ceph

Fetching downloads all objects (commits, branches) that were added sincethe last sync. After running these commands, all the branches fromceph/ceph.git are downloaded to the local git repo asremotes/ceph/$BRANCH_NAME and can be referenced asceph/$BRANCH_NAME in certain git commands.

For example, your local master branch can be reset to the upstream Cephmaster branch by doing:

  1. $ git fetch ceph
  2. $ git checkout master
  3. $ git reset --hard ceph/master

Finally, the master branch of your fork can then be synced to upstreammaster by:

  1. $ git push -u origin master

Bugfix branch

Next, create a branch for the bugfix:

  1. $ git checkout master
  2. $ git checkout -b fix_1
  3. $ git push -u origin fix_1

This creates a fix_1 branch locally and in our GitHub fork. At thispoint, the fix_1 branch is identical to the master branch, but notfor long! You are now ready to modify the code.

Fix bug locally

At this point, change the status of the tracker issue to “In progress” tocommunicate to the other Ceph developers that you have begun working on afix. If you don’t have permission to change that field, your comment thatyou are working on the issue is sufficient.

Possibly, your fix is very simple and requires only minimal testing.More likely, it will be an iterative process involving trial and error, notto mention skill. An explanation of how to fix bugs is beyond thescope of this document. Instead, we focus on the mechanics of the processin the context of the Ceph project.

A detailed discussion of the tools available for validating your bugfixes,see the Testing chapters.

For now, let us just assume that you have finished work on the bugfix andthat you have tested it and believe it works. Commit the changes to your localbranch using the —signoff option:

  1. $ git commit -as

and push the changes to your fork:

  1. $ git push origin fix_1

GitHub pull request

The next step is to open a GitHub pull request. The purpose of this step isto make your bugfix available to the community of Ceph developers. Theywill review it and may do additional testing on it.

In short, this is the point where you “go public” with your modifications.Psychologically, you should be prepared to receive suggestions andconstructive criticism. Don’t worry! In our experience, the Ceph project isa friendly place!

If you are uncertain how to use pull requests, you may readthis GitHub pull request tutorial.

For some ideas on what constitutes a “good” pull request, seethe Git Commit Good Practice article at the OpenStack Project Wiki.

Once your pull request (PR) is opened, update the Issue tracker byadding a comment to the bug pointing the other developers to your PR. Theupdate can be as simple as:

  1. *PR*: https://github.com/ceph/ceph/pull/$NUMBER_OF_YOUR_PULL_REQUEST

Automated PR validation

When your PR hits GitHub, the Ceph project’s Continuous Integration (CI)infrastructure will test it automatically. At the time of this writing(March 2016), the automated CI testing included a test to check that thecommits in the PR are properly signed (see Submitting patches) and amake check test.

The latter, make check, builds the PR and runs it through a battery oftests. These tests run on machines operated by the Ceph ContinuousIntegration (CI) team. When the tests complete, the result will be shownon GitHub in the pull request itself.

You can (and should) also test your modifications before you open a PR.Refer to the Testing chapters for details.

Notes on PR make check test

The GitHub make check test is driven by a Jenkins instance.

Jenkins merges the PR branch into the latest version of the base branch beforestarting the build, so you don’t have to rebase the PR to pick up any fixes.

You can trigger the PR tests at any time by adding a comment to the PR - thecomment should contain the string “test this please”. Since a human subscribedto the PR might interpret that as a request for him or her to test the PR, it’sgood to write the request as “Jenkins, test this please”.

The make check log is the place to go if there is a failure and you’re notsure what caused it. To reach it, first click on “details” (next to the makecheck test in the PR) to get into the Jenkins web GUI, and then click on“Console Output” (on the left).

Jenkins is set up to grep the log for strings known to have been associatedwith make check failures in the past. However, there is no guarantee thatthe strings are associated with any given make check failure. You have todig into the log to be sure.

Integration tests AKA ceph-qa-suite

Since Ceph is a complex beast, it may also be necessary to test your fix tosee how it behaves on real clusters running either on real or virtualhardware. Tests designed for this purpose live in the ceph/qasub-directory and are run via the teuthology framework.

The Ceph community has access to the Sepia lab where integration tests can be run onreal hardware. Other developers may add tags like “needs-qa” to your PR.This allows PRs that need testing to be merged into a single branch andtested all at the same time. Since teuthology suites can take hours(even days in some cases) to run, this can save a lot of time.

To request access to the Sepia lab, start here.

Integration testing is discussed in more detail in the integration testing chapter.

Code review

Once your bugfix has been thoroughly tested, or even during this process,it will be subjected to code review by other developers. This typicallytakes the form of correspondence in the PR itself, but can be supplementedby discussions on IRC and the Mailing list.

Amending your PR

While your PR is going through Testing and Code review, you canmodify it at any time by editing files in your local branch.

After the changes are committed locally (to the fix_1 branch in ourexample), they need to be pushed to GitHub so they appear in the PR.

Modifying the PR is done by adding commits to the fix_1 branch uponwhich it is based, often followed by rebasing to modify the branch’s githistory. See this tutorial for a goodintroduction to rebasing. When you are done with your modifications, youwill need to force push your branch with:

  1. $ git push --force origin fix_1

Merge

The bugfixing process culminates when one of the project leads decides tomerge your PR.

When this happens, it is a signal for you (or the lead who merged the PR)to change the Issue tracker status to “Resolved”. Some issues may beflagged for backporting, in which case the status should be changed to“Pending Backport” (see the Backporting chapter for details).

Testing - unit tests

Ceph has two types of tests: unit tests (also called make check tests) andintegration tests. Strictly speaking, the make check tests are not “unittests”, but rather tests that can be run easily on a single build machine aftercompiling Ceph from source, whereas integration tests require packages andmulti-machine clusters to run.

What does “make check” mean?

After compiling Ceph, the code can be run through a battery of tests coveringvarious aspects of Ceph. For historical reasons, this battery of tests is oftenreferred to as make check even though the actual command used to run thetests is now ctest. For inclusion in this battery of tests, a test must:

  • bind ports that do not conflict with other tests

  • not require root access

  • not require more than one machine to run

  • complete within a few minutes

For simplicity, we will refer to this class of tests as “make check tests” or“unit tests”, to distinguish them from the more complex “integration tests”that are run via the teuthology framework.

While it is possible to run ctest directly, it can be tricky to correctlyset up your environment. Fortunately, a script is provided to make it easierrun the unit tests on your code. It can be run from the top-level directory ofthe Ceph source tree by doing:

  1. $ ./run-make-check.sh

You will need a minimum of 8GB of RAM and 32GB of free disk space for thiscommand to complete successfully on x86_64 (other architectures may havedifferent constraints). Depending on your hardware, it can take from 20minutes to three hours to complete, but it’s worth the wait.

How unit tests are declared

Unit tests are declared in the CMakeLists.txt files (multiple files under./src) using the add_ceph_test or add_ceph_unittest CMake functions,which are themselves defined in ./cmake/modules/AddCephTest.cmake. Someunit tests are scripts, while others are binaries that are compiled during thebuild process. The add_ceph_test function is used to declare unit testscripts, while add_ceph_unittest is used for unit test binaries.

Unit testing of CLI tools

Some of the CLI tools are tested using special files ending with the extension.t and stored under ./src/test/cli. These tests are run using a toolcalled cram via a shell script ./src/test/run-cli-tests. cram teststhat are not suitable for make check may also be run by teuthology usingthe cram task.

Tox based testing of python modules

Most python modules can be found under ./src/pybind/.

Many modules use tox to run their unit tests.tox itself is a generic virtualenv management and test command line tool.

To find out quickly if tox can be run you can either just try to run tox or find out if atox.ini exists.

Currently the following modules use tox:

  • Ansible (./src/pybind/mgr/ansible)

  • Insights (./src/pybind/mgr/insights)

  • Orchestrator cli (./src/pybind/mgr/orchestrator_cli)

  • Manager core (./src/pybind/mgr)

  • Dashboard (./src/pybind/mgr/dashboard)

  • Python common (./src/python-common/tox.ini)

Most tox configuration support multiple environments and tasks. You can see which environments andtasks are supported by looking into the tox.ini file to see what envlist is assigned.To run tox, just execute tox in the directory where tox.ini lies.Without any specified environments -e $env1,$env2, all environments will be run.Jenkins will run tox by executing run_tox.sh which lies under ./src/script.

Here some examples from ceph dashboard on how to specify different environments and run options:

  1. ## Run Python 2+3 tests+lint commands:
  2. $ tox -e py27,py3,lint,check
  3.  
  4. ## Run Python 3 tests+lint commands:
  5. $ tox -e py3,lint,check
  6.  
  7. ## To run it like Jenkins would do
  8. $ ../../../script/run_tox.sh --tox-env py27,py3,lint,check
  9. $ ../../../script/run_tox.sh --tox-env py3,lint,check
Manager core unit tests

Currently only doctests insidemgr_util.py are run.

To add more files that should be tested inside the core of the manager add them at the endof the line that includes mgr_util.py inside tox.ini.

Unit test caveats

  • Unlike the various Ceph daemons and ceph-fuse, the unit testsare linked against the default memory allocator (glibc) unless explicitlylinked against something else. This enables tools like valgrind to be usedin the tests.

Testing - Integration Tests

Ceph has two types of tests: make check tests and integration tests.When a test requires multiple machines, root access or lasts for alonger time (for example, to simulate a realistic Ceph deployment), itis deemed to be an integration test. Integration tests are organized into“suites”, which are defined in the ceph/qa sub-directory and run withthe teuthology-suite command.

The teuthology-suite command is part of the teuthology framework.In the sections that follow we attempt to provide a detailed introductionto that framework from the perspective of a beginning Ceph developer.

Teuthology consumes packages

It may take some time to understand the significance of this fact, but itis very significant. It means that automated tests can be conducted onmultiple platforms using the same packages (RPM, DEB) that can beinstalled on any machine running those platforms.

Teuthology has a list of platforms that it supports (asof December 2017 the list consisted of “CentOS 7.2” and “Ubuntu 16.04”). Itexpects to be provided pre-built Ceph packages for these platforms.Teuthology deploys these platforms on machines (bare-metal orcloud-provisioned), installs the packages on them, and deploys Cephclusters on them - all as called for by the test.

The Nightlies

A number of integration tests are run on a regular basis in the Sepialab against the official Ceph repositories (on the master developmentbranch and the stable branches). Traditionally, these tests are called “thenightlies” because the Ceph core developers used to live and work inthe same time zone and from their perspective the tests were run overnight.

The results of the nightlies are published at http://pulpito.ceph.com/. The developer nick shows in thetest results URL and in the first column of the Pulpito dashboard. Theresults are also reported on the ceph-qa mailing list for analysis.

Testing Priority

The teuthology-suite command includes an almost mandatory option -p <N>which specifies the priority of the jobs submitted to the queue. The lowerthe value of N, the higher the priority. The option is almost mandatory becausethe default is 1000 which matches the priority of the nightlies. Nightliesare often half-finished and cancelled due to the volume of testing done so yourjobs may never finish. Therefore, it is common to select a priority less than1000.

Any priority may be selected when submitting jobs. But, in order to besensitive to the workings of other developers that also need to do testing,the following recommendations should be followed:

  • Priority < 10: Use this if the sky is falling and some group of tests must be run ASAP.

  • 10 <= Priority < 50: Use this if your tests are urgent and blocking other important development.

  • 50 <= Priority < 75: Use this if you are testing a particular feature/fix and running fewer than about 25 jobs. This range can also be used for urgent release testing.

  • 75 <= Priority < 100: Tech Leads will regularly schedule integration tests with this priority to verify pull requests against master.

  • 100 <= Priority < 150: This priority is to be used for QE validation of point releases.

  • 150 <= Priority < 200: Use this priority for 100 jobs or fewer of a particular feature/fix that you’d like results on in a day or so.

  • 200 <= Priority < 1000: Use this priority for large test runs that can be done over the course of a week.

In case you don’t know how many jobs would be triggered byteuthology-suite command, use —dry-run to get a count first and thenissue teuthology-suite command again, this time without —dry-run andwith -p and an appropriate number as an argument to it.

Suites Inventory

The suites directory of the ceph/qa sub-directory containsall the integration tests, for all the Ceph components.

  • ceph-deploy
  • install a Ceph cluster with ceph-deploy (ceph-deploy man page)

  • dummy

  • get a machine, do nothing and return success (commonly used toverify the integration testing infrastructure works as expected)

  • fs

  • test CephFS mounted using FUSE

  • kcephfs

  • test CephFS mounted using kernel

  • krbd

  • test the RBD kernel module

  • multimds

  • test CephFS with multiple MDSs

  • powercycle

  • verify the Ceph cluster behaves when machines are powered offand on again

  • rados

  • run Ceph clusters including OSDs and MONs, under various conditions ofstress

  • rbd

  • run RBD tests using actual Ceph clusters, with and without qemu

  • rgw

  • run RGW tests using actual Ceph clusters

  • smoke

  • run tests that exercise the Ceph API with an actual Ceph cluster

  • teuthology

  • verify that teuthology can run integration tests, with and without OpenStack

  • upgrade

  • for various versions of Ceph, verify that upgrades can happenwithout disrupting an ongoing workload

teuthology-describe-tests

In February 2016, a new feature called teuthology-describe-tests wasadded to the teuthology framework to facilitate documentation and betterunderstanding of integration tests (feature announcement).

The upshot is that tests can be documented by embedding meta:annotations in the yaml files used to define the tests. The results can beseen in the ceph-qa-suite wiki.

Since this is a new feature, many yaml files have yet to be annotated.Developers are encouraged to improve the documentation, in terms of bothcoverage and quality.

How integration tests are run

Given that - as a new Ceph developer - you will typically not have accessto the Sepia lab, you may rightly ask how you can run the integrationtests in your own environment.

One option is to set up a teuthology cluster on bare metal. Though this isa non-trivial task, it is possible. Here are some notes to get you startedif you decide to go this route.

If you have access to an OpenStack tenant, you have another option: theteuthology framework has an OpenStack backend, which is documented here.This OpenStack backend can build packages from a given git commit orbranch, provision VMs, install the packages and run integration testson those VMs. This process is controlled using a tool calledceph-workbench ceph-qa-suite. This tool also automates publishing oftest results at http://teuthology-logs.public.ceph.com.

Running integration tests on your code contributions and publishing theresults allows reviewers to verify that changes to the code base do notcause regressions, or to analyze test failures when they do occur.

Every teuthology cluster, whether bare-metal or cloud-provisioned, has aso-called “teuthology machine” from which tests suites are triggered using theteuthology-suite command.

A detailed and up-to-date description of each teuthology-suite option isavailable by running the following command on the teuthology machine:

  1. $ teuthology-suite --help

How integration tests are defined

Integration tests are defined by yaml files found in the suitessubdirectory of the ceph/qa sub-directory and implemented by pythoncode found in the tasks subdirectory. Some tests (“standalone tests”)are defined in a single yaml file, while other tests are defined by adirectory tree containing yaml files that are combined, at runtime, into alarger yaml file.

Reading a standalone test

Let us first examine a standalone test, or “singleton”.

Here is a commented example using the integration testrados/singleton/all/admin-socket.yaml

  1. roles:
  2. - - mon.a
  3. - osd.0
  4. - osd.1
  5. tasks:
  6. - install:
  7. - ceph:
  8. - admin_socket:
  9. osd.0:
  10. version:
  11. git_version:
  12. help:
  13. config show:
  14. config set filestore_dump_file /tmp/foo:
  15. perf dump:
  16. perf schema:

The roles array determines the composition of the cluster (howmany MONs, OSDs, etc.) on which this test is designed to run, as wellas how these roles will be distributed over the machines in thetesting cluster. In this case, there is only one element in thetop-level array: therefore, only one machine is allocated to thetest. The nested array declares that this machine shall run a MON withid a (that is the mon.a in the list of roles) and two OSDs(osd.0 and osd.1).

The body of the test is in the tasks array: each element isevaluated in order, causing the corresponding python file found in thetasks subdirectory of the teuthology repository orceph/qa sub-directory to be run. “Running” in this case means callingthe task() function defined in that file.

In this case, the installtask comes first. It installs the Ceph packages on each machine (asdefined by the roles array). A full description of the installtask is found in the python file(search for “def task”).

The ceph task, which is documented here (again,search for “def task”), starts OSDs and MONs (and possibly MDSs as well)as required by the roles array. In this example, it will start one MON(mon.a) and two OSDs (osd.0 and osd.1), all on the samemachine. Control moves to the next task when the Ceph cluster reachesHEALTH_OK state.

The next task is admin_socket (source code).The parameter of the admin_socket task (and any other task) is astructure which is interpreted as documented in the task. In this examplethe parameter is a set of commands to be sent to the admin socket ofosd.0. The task verifies that each of them returns on success (i.e.exit code zero).

This test can be run with:

  1. $ teuthology-suite --suite rados/singleton/all/admin-socket.yaml fs/ext4.yaml

Test descriptions

Each test has a “test description”, which is similar to a directory path,but not the same. In the case of a standalone test, like the one inReading a standalone test, the test description is identical to therelative path (starting from the suites/ directory of theceph/qa sub-directory) of the yaml file defining the test.

Much more commonly, tests are defined not by a single yaml file, but by adirectory tree of yaml files. At runtime, the tree is walked and all yamlfiles (facets) are combined into larger yaml “programs” that define thetests. A full listing of the yaml defining the test is included at thebeginning of every test log.

In these cases, the description of each test consists of thesubdirectory under suites/ containing theyaml facets, followed by an expression in curly braces ({}) consisting ofa list of yaml facets in order of concatenation. For instance thetest description:

  1. ceph-deploy/basic/{distros/centos_7.0.yaml tasks/ceph-deploy.yaml}

signifies the concatenation of two files:

  • ceph-deploy/basic/distros/centos_7.0.yaml

  • ceph-deploy/basic/tasks/ceph-deploy.yaml

How tests are built from directories

As noted in the previous section, most tests are not defined in a singleyaml file, but rather as a combination of files collected from adirectory tree within the suites/ subdirectory of the ceph/qa sub-directory.

The set of all tests defined by a given subdirectory of suites/ iscalled an “integration test suite”, or a “teuthology suite”.

Combination of yaml facets is controlled by special files (% and+) that are placed within the directory tree and can be thought of asoperators. The % file is the “convolution” operator and +signifies concatenation.

Convolution operator

The convolution operator, implemented as an empty file called %, tellsteuthology to construct a test matrix from yaml facets found insubdirectories below the directory containing the operator.

For example, the ceph-deploy suite isdefined by the suites/ceph-deploy/ tree, which consists of the files andsubdirectories in the following structure:

  1. directory: ceph-deploy/basic
  2. file: %
  3. directory: distros
  4. file: centos_7.0.yaml
  5. file: ubuntu_16.04.yaml
  6. directory: tasks
  7. file: ceph-deploy.yaml

This is interpreted as a 2x1 matrix consisting of two tests:

  • ceph-deploy/basic/{distros/centos_7.0.yaml tasks/ceph-deploy.yaml}

  • ceph-deploy/basic/{distros/ubuntu_16.04.yaml tasks/ceph-deploy.yaml}

i.e. the concatenation of centos_7.0.yaml and ceph-deploy.yaml andthe concatenation of ubuntu_16.04.yaml and ceph-deploy.yaml, respectively.In human terms, this means that the task found in ceph-deploy.yaml isintended to run on both CentOS 7.0 and Ubuntu 16.04.

Without the file percent, the ceph-deploy tree would be interpreted asthree standalone tests:

  • ceph-deploy/basic/distros/centos_7.0.yaml

  • ceph-deploy/basic/distros/ubuntu_16.04.yaml

  • ceph-deploy/basic/tasks/ceph-deploy.yaml

(which would of course be wrong in this case).

Referring to the ceph/qa sub-directory, you will notice that thecentos_7.0.yaml and ubuntu_16.04.yaml files in thesuites/ceph-deploy/basic/distros/ directory are implemented as symlinks.By using symlinks instead of copying, a single file can appear in multiplesuites. This eases the maintenance of the test framework as a whole.

All the tests generated from the suites/ceph-deploy/ directory tree(also known as the “ceph-deploy suite”) can be run with:

  1. $ teuthology-suite --suite ceph-deploy

An individual test from the ceph-deploy suite can be run by adding the—filter option:

  1. $ teuthology-suite \
  2. --suite ceph-deploy/basic \
  3. --filter 'ceph-deploy/basic/{distros/ubuntu_16.04.yaml tasks/ceph-deploy.yaml}'

Note

To run a standalone test like the one in Reading a standalonetest, —suite alone is sufficient. If you want to run a singletest from a suite that is defined as a directory tree, —suite mustbe combined with —filter. This is because the —suite optionunderstands POSIX relative paths only.

Concatenation operator

For even greater flexibility in sharing yaml files between suites, thespecial file plus (+) can be used to concatenate files within adirectory. For instance, consider the suites/rbd/thrashtree:

  1. directory: rbd/thrash
  2. file: %
  3. directory: clusters
  4. file: +
  5. file: fixed-2.yaml
  6. file: openstack.yaml
  7. directory: workloads
  8. file: rbd_api_tests_copy_on_read.yaml
  9. file: rbd_api_tests.yaml

This creates two tests:

  • rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}

  • rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests.yaml}

Because the clusters/ subdirectory contains the special file plus(+), all the other files in that subdirectory (fixed-2.yaml andopenstack.yaml in this case) are concatenated togetherand treated as a single file. Without the special file plus, they wouldhave been convolved with the files from the workloads directory to createa 2x2 matrix:

  • rbd/thrash/{clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}

  • rbd/thrash/{clusters/openstack.yaml workloads/rbd_api_tests.yaml}

  • rbd/thrash/{clusters/fixed-2.yaml workloads/rbd_api_tests_copy_on_read.yaml}

  • rbd/thrash/{clusters/fixed-2.yaml workloads/rbd_api_tests.yaml}

The clusters/fixed-2.yaml file is shared among many suites todefine the following roles:

  1. roles:
  2. - [mon.a, mon.c, osd.0, osd.1, osd.2, client.0]
  3. - [mon.b, osd.3, osd.4, osd.5, client.1]

The rbd/thrash suite as defined above, consisting of two tests,can be run with:

  1. $ teuthology-suite --suite rbd/thrash

A single test from the rbd/thrash suite can be run by adding the—filter option:

  1. $ teuthology-suite \
  2. --suite rbd/thrash \
  3. --filter 'rbd/thrash/{clusters/fixed-2.yaml clusters/openstack.yaml workloads/rbd_api_tests_copy_on_read.yaml}'

Filtering tests by their description

When a few jobs fail and need to be run again, the —filter optioncan be used to select tests with a matching description. For instance, if therados suite fails the all/peer.yaml test, the following will only run the tests that contain this file:

  1. teuthology-suite --suite rados --filter all/peer.yaml

The —filter-out option does the opposite (it matches tests that donot contain a given string), and can be combined with the —filteroption.

Both —filter and —filter-out take a comma-separated list of strings (whichmeans the comma character is implicitly forbidden in filenames found in theceph/qa sub-directory). For instance:

  1. teuthology-suite --suite rados --filter all/peer.yaml,all/rest-api.yaml

will run tests that contain eitherall/peer.yamlorall/rest-api.yaml

Each string is looked up anywhere in the test description and has tobe an exact match: they are not regular expressions.

Reducing the number of tests

The rados suite generates tens or even hundreds of thousands of tests outof a few hundred files. This happens because teuthology constructs testmatrices from subdirectories wherever it encounters a file named %. Forinstance, all tests in the rados/basic suite run withdifferent messenger types: simple, async and random, because theyare combined (via the special file %) with the msgr directory

All integration tests are required to be run before a Ceph release is published.When merely verifying whether a contribution can be merged withoutrisking a trivial regression, it is enough to run a subset. The —subsetoption can be used to reduce the number of tests that are triggered. Forinstance:

  1. teuthology-suite --suite rados --subset 0/4000

will run as few tests as possible. The tradeoff in this case is thatnot all combinations of test variations will together,but no matter how small a ratio is provided in the —subset,teuthology will still ensure that all files in the suite are in atleast one test. Understanding the actual logic that drives thisrequires reading the teuthology source code.

The —limit option only runs the first N tests in the suite:this is rarely useful, however, because there is no way to control whichtest will be first.

Testing in the cloud

In this chapter, we will explain in detail how use an OpenStacktenant as an environment for Ceph integration testing.

Assumptions and caveat

We assume that:

  • you are the only person using the tenant

  • you have the credentials

  • the tenant supports the nova and cinder APIs

Caveat: be aware that, as of this writing (July 2016), testing inOpenStack clouds is a new feature. Things may not work as advertised.If you run into trouble, ask for help on IRC or the Mailing list, oropen a bug report at the ceph-workbench bug tracker.

Prepare tenant

If you have not tried to use ceph-workbench with this tenant before,proceed to the next step.

To start with a clean slate, login to your tenant via the Horizon dashboard and:

  • terminate the teuthology and packages-repository instances, if any

  • delete the teuthology and teuthology-worker security groups, if any

  • delete the teuthology and teuthology-myself key pairs, if any

Also do the above if you ever get key-related errors (“invalid key”, etc.) whentrying to schedule suites.

Getting ceph-workbench

Since testing in the cloud is done using the ceph-workbench ceph-qa-suitetool, you will need to install that first. It is designedto be installed via Docker, so if you don’t have Docker running on yourdevelopment machine, take care of that first. You can follow the officialtutorial to install ifyou have not installed yet.

Once Docker is up and running, install ceph-workbench by following theInstallation instructions in the ceph-workbench documentation.

Linking ceph-workbench with your OpenStack tenant

Before you can trigger your first teuthology suite, you will need to linkceph-workbench with your OpenStack account.

First, download a openrc.sh file by clicking on the “Download OpenStackRC File” button, which can be found in the “API Access” tab of the “Access& Security” dialog of the OpenStack Horizon dashboard.

Second, create a ~/.ceph-workbench directory, set its permissions to700, and move the openrc.sh file into it. Make sure that the filenameis exactly ~/.ceph-workbench/openrc.sh.

Third, edit the file so it does not ask for your OpenStack passwordinteractively. Comment out the relevant lines and replace them withsomething like:

  1. export OS_PASSWORD="aiVeth0aejee3eep8rogho3eep7Pha6ek"

When ceph-workbench ceph-qa-suite connects to your OpenStack tenant forthe first time, it will generate two keypairs: teuthology-myself andteuthology.

Run the dummy suite

You are now ready to take your OpenStack teuthology setup for a testdrive:

  1. $ ceph-workbench ceph-qa-suite --suite dummy

Be forewarned that the first run of ceph-workbench ceph-qa-suite on apristine tenant will take a long time to complete because it downloads a VMimage and during this time the command may not produce any output.

The images are cached in OpenStack, so they are only downloaded once.Subsequent runs of the same command will complete faster.

Although dummy suite does not run any tests, in all other respects itbehaves just like a teuthology suite and produces some of the sameartifacts.

The last bit of output should look something like this:

  1. pulpito web interface: http://149.202.168.201:8081/
  2. ssh access : ssh -i /home/smithfarm/.ceph-workbench/teuthology-myself.pem ubuntu@149.202.168.201 # logs in /usr/share/nginx/html

What this means is that ceph-workbench ceph-qa-suite triggered the testsuite run. It does not mean that the suite run has completed. To monitorprogress of the run, check the Pulpito web interface URL periodically, orif you are impatient, ssh to the teuthology machine using the ssh commandshown and do:

  1. $ tail -f /var/log/teuthology.*

The /usr/share/nginx/html directory contains the complete logs of thetest suite. If we had provided the —upload option to theceph-workbench ceph-qa-suite command, these logs would have beenuploaded to http://teuthology-logs.public.ceph.com.

Run a standalone test

The standalone test explained in Reading a standalone test can be runwith the following command:

  1. $ ceph-workbench ceph-qa-suite --suite rados/singleton/all/admin-socket.yaml

This will run the suite shown on the current master branch ofceph/ceph.git. You can specify a different branch with the —cephoption, and even a different git repo with the —ceph-git-url option. (Runceph-workbench ceph-qa-suite —help for an up-to-date list of availableoptions.)

The first run of a suite will also take a long time, because ceph packageshave to be built, first. Again, the packages so built are cached andceph-workbench ceph-qa-suite will not build identical packages a secondtime.

Interrupt a running suite

Teuthology suites take time to run. From time to time one may wish tointerrupt a running suite. One obvious way to do this is:

  1. ceph-workbench ceph-qa-suite --teardown

This destroys all VMs created by ceph-workbench ceph-qa-suite andreturns the OpenStack tenant to a “clean slate”.

Sometimes you may wish to interrupt the running suite, but keep the logs,the teuthology VM, the packages-repository VM, etc. To do this, you canssh to the teuthology VM (using the ssh access command reportedwhen you triggered the suite – see Run the dummy suite) and, oncethere:

  1. sudo /etc/init.d/teuthology restart

This will keep the teuthology machine, the logs and the packages-repositoryinstance but nuke everything else.

Upload logs to archive server

Since the teuthology instance in OpenStack is only semi-permanent, with limitedspace for storing logs, teuthology-openstack provides an —uploadoption which, if included in the ceph-workbench ceph-qa-suite command,will cause logs from all failed jobs to be uploaded to the log archive servermaintained by the Ceph project. The logs will appear at the URL:

  1. http://teuthology-logs.public.ceph.com/$RUN

where $RUN is the name of the run. It will be a string like this:

  1. ubuntu-2016-07-23_16:08:12-rados-hammer-backports---basic-openstack

Even if you don’t providing the —upload option, however, all the logs canstill be found on the teuthology machine in the directory/usr/share/nginx/html.

Provision VMs ad hoc

From the teuthology VM, it is possible to provision machines on an “ad hoc”basis, to use however you like. The magic incantation is:

  1. teuthology-lock --lock-many $NUMBER_OF_MACHINES \
  2. --os-type $OPERATING_SYSTEM \
  3. --os-version $OS_VERSION \
  4. --machine-type openstack \
  5. --owner $EMAIL_ADDRESS

The command must be issued from the ~/teuthology directory. The possiblevalues for OPERATING_SYSTEM AND OS_VERSION can be found by examiningthe contents of the directory teuthology/openstack/. For example:

  1. teuthology-lock --lock-many 1 --os-type ubuntu --os-version 16.04 \
  2. --machine-type openstack --owner foo@example.com

When you are finished with the machine, find it in the list of machines:

  1. openstack server list

to determine the name or ID, and then terminate it with:

  1. openstack server delete $NAME_OR_ID

Deploy a cluster for manual testing

The teuthology framework and ceph-workbench ceph-qa-suite areversatile tools that automatically provision Ceph clusters in the cloud andrun various tests on them in an automated fashion. This enables a singleengineer, in a matter of hours, to perform thousands of tests that wouldkeep dozens of human testers occupied for days or weeks if conductedmanually.

However, there are times when the automated tests do not cover a particularscenario and manual testing is desired. It turns out that it is simple toadapt a test to stop and wait after the Ceph installation phase, and theengineer can then ssh into the running cluster. Simply add the followingsnippet in the desired place within the test YAML and schedule a run with thetest:

  1. tasks:
  2. - exec:
  3. client.0:
  4. - sleep 1000000000 # forever

(Make sure you have a client.0 defined in your roles stanza or adaptaccordingly.)

The same effect can be achieved using the interactive task:

  1. tasks:
  2. - interactive

By following the test log, you can determine when the test cluster has enteredthe “sleep forever” condition. At that point, you can ssh to the teuthologymachine and from there to one of the target VMs (OpenStack) or teuthologyworker machines machine (Sepia) where the test cluster is running.

The VMs (or “instances” in OpenStack terminology) created byceph-workbench ceph-qa-suite are named as follows:

teuthology - the teuthology machine

packages-repository - VM where packages are stored

ceph-* - VM where packages are built

target* - machines where tests are run

The VMs named target* are used by tests. If you are monitoring theteuthology log for a given test, the hostnames of these target machines canbe found out by searching for the string Locked targets:

  1. 2016-03-20T11:39:06.166 INFO:teuthology.task.internal:Locked targets:
  2. target149202171058.teuthology: null
  3. target149202171059.teuthology: null

The IP addresses of the target machines can be found by running openstackserver list on the teuthology machine, but the target VM hostnames (e.g.target149202171058.teuthology) are resolvable within the teuthologycluster.

Running tests from qa/ locally

How to run s3-tests locally

RGW code can be tested by building Ceph locally from source, starting a vstartcluster, and running the “s3-tests” suite against it.

The following instructions should work on jewel and above.

Step 1 - build Ceph

Refer to Build Ceph.

You can do step 2 separately while it is building.

Step 2 - vstart

When the build completes, and still in the top-level directory of the gitclone where you built Ceph, do the following, for cmake builds:

  1. cd build/
  2. RGW=1 ../src/vstart.sh -n

This will produce a lot of output as the vstart cluster is started up. At theend you should see a message like:

  1. started. stop.sh to stop. see out/* (e.g. 'tail -f out/????') for debug output.

This means the cluster is running.

Step 3 - run s3-tests

To run the s3tests suite do the following:

  1. $ ../qa/workunits/rgw/run-s3tests.sh

Running test using vstart_runner.py

CephFS and Ceph Manager code is be tested using vstart_runner.py.

Running your first test

The Python tests in Ceph repository can be executed on your local machineusing vstart_runner.py. To do that, you’d need teuthology installed:

  1. $ git clone https://github.com/ceph/teuthology
  2. $ cd teuthology/
  3. $ virtualenv -p python2.7 ./venv
  4. $ source venv/bin/activate
  5. $ pip install --upgrade pip
  6. $ pip install -r requirements.txt
  7. $ python setup.py develop
  8. $ deactivate

Note

The pip command above is pip2, not pip3; run pip —version.

The above steps installs teuthology in a virtual environment. Before runninga test locally, build Ceph successfully from the source (referBuild Ceph) and do:

  1. $ cd build
  2. $ ../src/vstart.sh -n -d -l
  3. $ source ~/path/to/teuthology/venv/bin/activate

To run a specific test, say test_reconnect_timeout fromTestClientRecovery in qa/tasks/cephfs/test_client_recovery, you cando:

  1. $ python2 ../qa/tasks/vstart_runner.py tasks.cephfs.test_client_recovery.TestClientRecovery.test_reconnect_timeout

The above command runs vstart_runner.py and passes the test to be executed asan argument to vstart_runner.py. In a similar way, you can also run the groupof tests in the following manner:

  1. $ # run all tests in class TestClientRecovery
  2. $ python2 ../qa/tasks/vstart_runner.py tasks.cephfs.test_client_recovery.TestClientRecovery
  3. $ # run all tests in test_client_recovery.py
  4. $ python2 ../qa/tasks/vstart_runner.py tasks.cephfs.test_client_recovery

Based on the argument passed, vstart_runner.py collects tests and executes asit would execute a single test.

Note

vstart_runner.py as well as most tests in qa/ are onlycompatible with python2. Therefore, use python2 to run thetests locally.

vstart_runner.py can take the following options -

  • —clear-old-log
  • deletes old log file before running the test

  • —create

  • create Ceph cluster before running a test

  • —create-cluster-only

  • creates the cluster and quits; tests can be issuedlater

  • —interactive

  • drops a Python shell when a test fails

  • —log-ps-output

  • logs ps output; might be useful while debugging

  • —teardown

  • tears Ceph cluster down after test(s) has finishedrunnng

  • —kclient

  • use the kernel cephfs client instead of FUSE

Note

If using the FUSE client, ensure that the fuse package is installedand enabled on the system and that user_allow_other is addedto /etc/fuse.conf.

Note

If using the kernel client, the user must have the ability to runcommands with passwordless sudo access. A failure on the kernelclient may crash the host, so it’s recommended to use thisfunctionality within a virtual machine.

Internal working of vstart_runner.py -

vstart_runner.py primarily does three things -

    • collects and runs the tests
    • vstart_runner.py setups/teardowns the cluster and collects and runs thetest. This is implemented using methods scan_tests(), load_tests()and exec_test(). This is where all the options that vstart_runner.pytakes are implemented along with other features like logging and copyingthe traceback to the bottom of the log.
    • provides an interface for issuing and testing shell commands
    • The tests are written assuming that the cluster exists on remote machines.vstart_runner.py provides an interface to run the same tests with thecluster that exists within the local machine. This is done using the classLocalRemote. Class LocalRemoteProcess can manage the process thatexecutes the commands from LocalRemote, class LocalDaemon providesan interface to handle Ceph daemons and class LocalFuseMount cancreate and handle FUSE mounts.
    • provides an interface to operate Ceph cluster
    • LocalCephManager provides methods to run Ceph cluster commands withand without admin socket and LocalCephCluster provides methods to setor clear ceph.conf.