CephFS Snapshots
CephFS supports snapshots, generally created by invoking mkdir within the.snap
directory. Note this is a hidden, special directory, not visibleduring a directory listing.
Overview
Generally, snapshots do what they sound like: they create an immutable viewof the file system at the point in time they’re taken. There are some headlinefeatures that make CephFS snapshots different from what you might expect:
Arbitrary subtrees. Snapshots are created within any directory you choose,and cover all data in the file system under that directory.
Asynchronous. If you create a snapshot, buffered data is flushed out lazily,including from other clients. As a result, “creating” the snapshot isvery fast.
Important Data Structures
SnapRealm: A SnapRealm is created whenever you create a snapshot at a newpoint in the hierarchy (or, when a snapshotted inode is move outside of itsparent snapshot). SnapRealms contain an sr_t srnode, and inodes_with_caps_that are part of the snapshot. Clients also have a SnapRealm concept thatmaintains less data but is used to associate a _SnapContext with each openfile for writing.
srt: An _sr_t is the on-disk snapshot metadata. It is part of the containingdirectory and contains sequence counters, timestamps, the list of associatedsnapshot IDs, and past_parent_snaps.
SnapServer: SnapServer manages snapshot ID allocation, snapshot deletion andtracks list of effective snapshots in the file system. A file system only hasone instance of snapserver.
SnapClient: SnapClient is used to communicate with snapserver, each MDS rankhas its own snapclient instance. SnapClient also caches effective snapshotslocally.
Creating a snapshot
CephFS snapshot feature is enabled by default on new file system. To enable iton existing file systems, use command below.
- $ ceph fs set <fs_name> allow_new_snaps true
When snapshots are enabled, all directories in CephFS will have a special.snap
directory. (You may configure a different name with the clientsnapdir
setting if you wish.)
To create a CephFS snapshot, create a subdirectory under.snap
with a name of your choice. For example, to create a snapshot ondirectory “/1/2/3/”, invoke mkdir /1/2/3/.snap/my-snapshot-name
.
This is transmitted to the MDS Server as aCEPHMDS_OP_MKSNAP-tagged _MClientRequest, and initially handled inServer::handleclient_mksnap(). It allocates a _snapid from the SnapServer,projects a new inode with the new SnapRealm, and commits it to the MDLog asusual. When committed, it invokesMDCache::do_realm_invalidate_and_update_notify(), which notifies all clientswith caps on files under “/1/2/3/”, about the new SnapRealm. When clients getthe notifications, they update client-side SnapRealm hierarchy, link filesunder “/1/2/3/” to the new SnapRealm and generate a SnapContext for thenew SnapRealm.
Note that this is not a synchronous part of the snapshot creation!
Updating a snapshot
If you delete a snapshot, a similar process is followed. If you remove an inodeout of its parent SnapRealm, the rename code creates a new SnapRealm for therenamed inode (if SnapRealm does not already exist), saves IDs of snapshots thatare effective on the original parent SnapRealm into past_parent_snaps of thenew SnapRealm, then follows a process similar to creating snapshot.
Generating a SnapContext
A RADOS SnapContext consists of a snapshot sequence ID (snapid) and allthe snapshot IDs that an object is already part of. To generate that list, wecombine snapids associated with the SnapRealm and all valid snapids inpast_parent_snaps. Stale snapids are filtered out by SnapClient’s cachedeffective snapshots.
Storing snapshot data
File data is stored in RADOS “self-managed” snapshots. Clients are careful touse the correct SnapContext when writing file data to the OSDs.
Storing snapshot metadata
Snapshotted dentries (and their inodes) are stored in-line as part of thedirectory they were in at the time of the snapshot. All dentries include afirst and last snapid for which they are valid. (Non-snapshotted dentrieswill have their last set to CEPH_NOSNAP).
Snapshot writeback
There is a great deal of code to handle writeback efficiently. When a Clientreceives an MClientSnap message, it updates the local SnapRealm_representation and its links to specific _Inodes, and generates a CapSnap_for the _Inode. The CapSnap is flushed out as part of capability writeback,and if there is dirty data the CapSnap is used to block fresh data writesuntil the snapshot is completely flushed to the OSDs.
In the MDS, we generate snapshot-representing dentries as part of the regularprocess for flushing them. Dentries with outstanding CapSnap data is keptpinned and in the journal.
Deleting snapshots
Snapshots are deleted by invoking “rmdir” on the “.snap” directory they arerooted in. (Attempts to delete a directory which roots snapshots will fail;you must delete the snapshots first.) Once deleted, they are entered into theOSDMap list of deleted snapshots and the file data is removed by the OSDs.Metadata is cleaned up as the directory objects are read in and written backout again.
Hard links
Inode with multiple hard links is moved to a dummy global SnapRealm. Thedummy SnapRealm covers all snapshots in the file system. The inode’s datawill be preserved for any new snapshot. These preserved data will coversnapshots on any linkage of the inode.
Multi-FS
Snapshots and multiple file systems don’t interact well. Specifically, eachMDS cluster allocates snapids independently; if you have multiple file systemssharing a single pool (via namespaces), their snapshots will collide anddeleting one will result in missing file data for others. (This may even beinvisible, not throwing errors to the user.) If each FS gets its ownpool things probably work, but this isn’t tested and may not be true.