Synchronization primitives in the Linux kernel. Part 5.

Introduction

This is the fifth part of the chapter which describes synchronization primitives in the Linux kernel and in the previous parts we finished to consider different types spinlocks, semaphore and mutex synchronization primitives. We will continue to learn synchronization primitives in this part and start to consider special type of synchronization primitives - readers–writer lock.

The first synchronization primitive of this type will be already familiar for us - semaphore. As in all previous parts of this book, before we will consider implementation of the reader/writer semaphores in the Linux kernel, we will start from the theoretical side and will try to understand what is the difference between reader/writer semaphores and normal semaphores.

So, let’s start.

Reader/Writer semaphore

Actually there are two types of operations may be performed on the data. We may read data and make changes in data. Two fundamental operations - read and write. Usually (but not always), read operation is performed more often than write operation. In this case, it would be logical to we may lock data in such way, that some processes may read locked data in one time, on condition that no one will not change the data. The readers/writer lock allows us to get this lock.

When a process which wants to write something into data, all other writer and reader processes will be blocked until the process which acquired a lock, will not release it. When a process reads data, other processes which want to read the same data too, will not be locked and will be able to do this. As you may guess, implementation of the reader/writer semaphore is based on the implementation of the normal semaphore. We already familiar with the semaphore synchronization primitive from the third part of this chapter. From the theoretical side everything looks pretty simple. Let’s look how reader/writer semaphore is represented in the Linux kernel.

The semaphore is represented by the:

  1. struct semaphore {
  2. raw_spinlock_t lock;
  3. unsigned int count;
  4. struct list_head wait_list;
  5. };

structure. If you will look in the include/linux/rwsem.h header file, you will find definition of the rw_semaphore structure which represents reader/writer semaphore in the Linux kernel. Let’s look at the definition of this structure:

  1. #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
  2. #include <linux/rwsem-spinlock.h>
  3. #else
  4. struct rw_semaphore {
  5. long count;
  6. struct list_head wait_list;
  7. raw_spinlock_t wait_lock;
  8. #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
  9. struct optimistic_spin_queue osq;
  10. struct task_struct *owner;
  11. #endif
  12. #ifdef CONFIG_DEBUG_LOCK_ALLOC
  13. struct lockdep_map dep_map;
  14. #endif
  15. };

Before we will consider fields of the rw_semaphore structure, we may notice, that declaration of the rw_semaphore structure depends on the CONFIG_RWSEM_GENERIC_SPINLOCK kernel configuration option. This option is disabled for the x86_64 architecture by default. We can be sure in this by looking at the corresponding kernel configuration file. In our case, this configuration file is - arch/x86/um/Kconfig:

  1. config RWSEM_XCHGADD_ALGORITHM
  2. def_bool 64BIT
  3. config RWSEM_GENERIC_SPINLOCK
  4. def_bool !RWSEM_XCHGADD_ALGORITHM

So, as this book describes only x86_64 architecture related stuff, we will skip the case when the CONFIG_RWSEM_GENERIC_SPINLOCK kernel configuration is enabled and consider definition of the rw_semaphore structure only from the include/linux/rwsem.h header file.

If we will take a look at the definition of the rw_semaphore structure, we will notice that first three fields are the same that in the semaphore structure. It contains count field which represents amount of available resources, the wait_list field which represents doubly linked list of processes which are waiting to acquire a lock and wait_lock spinlock for protection of this list. Notice that rw_semaphore.count field is long type unlike the same field in the semaphore structure.

The count field of a rw_semaphore structure may have following values:

  • 0x0000000000000000 - reader/writer semaphore is in unlocked state and no one is waiting for a lock;
  • 0x000000000000000X - X readers are active or attempting to acquire a lock and no writer waiting;
  • 0xffffffff0000000X - may represent different cases. The first is - X readers are active or attempting to acquire a lock with waiters for the lock. The second is - one writer attempting a lock, no waiters for the lock. And the last - one writer is active and no waiters for the lock;
  • 0xffffffff00000001 - may represented two different cases. The first is - one reader is active or attempting to acquire a lock and exist waiters for the lock. The second case is one writer is active or attempting to acquire a lock and no waiters for the lock;
  • 0xffffffff00000000 - represents situation when there are readers or writers are queued, but no one is active or is in the process of acquire of a lock;
  • 0xfffffffe00000001 - a writer is active or attempting to acquire a lock and waiters are in queue.

So, besides the count field, all of these fields are similar to fields of the semaphore structure. Last three fields depend on the two configuration options of the Linux kernel: the CONFIG_RWSEM_SPIN_ON_OWNER and CONFIG_DEBUG_LOCK_ALLOC. The first two fields may be familiar us by declaration of the mutex structure from the previous part. The first osq field represents MCS lock spinner for optimistic spinning and the second represents process which is current owner of a lock.

The last field of the rw_semaphore structure is - dep_map - debugging related, and as I already wrote in previous parts, we will skip debugging related stuff in this chapter.

That’s all. Now we know a little about what is it reader/writer lock in general and reader/writer semaphore in particular. Additionally we saw how a reader/writer semaphore is represented in the Linux kernel. In this case, we may go ahead and start to look at the API which the Linux kernel provides for manipulation of reader/writer semaphores.

Reader/Writer semaphore API

So, we know a little about reader/writer semaphores from theoretical side, let’s look on its implementation in the Linux kernel. All reader/writer semaphores related API is located in the include/linux/rwsem.h header file.

As always Before we will consider an API of the reader/writer semaphore mechanism in the Linux kernel, we need to know how to initialize the rw_semaphore structure. As we already saw in previous parts of this chapter, all synchronization primitives may be initialized in two ways:

  • statically;
  • dynamically.

And reader/writer semaphore is not an exception. First of all, let’s take a look at the first approach. We may initialize rw_semaphore structure with the help of the DECLARE_RWSEM macro in compile time. This macro is defined in the include/linux/rwsem.h header file and looks:

  1. #define DECLARE_RWSEM(name) \
  2. struct rw_semaphore name = __RWSEM_INITIALIZER(name)

As we may see, the DECLARE_RWSEM macro just expands to the definition of the rw_semaphore structure with the given name. Additionally new rw_semaphore structure is initialized with the value of the __RWSEM_INITIALIZER macro:

  1. #define __RWSEM_INITIALIZER(name) \
  2. { \
  3. .count = RWSEM_UNLOCKED_VALUE, \
  4. .wait_list = LIST_HEAD_INIT((name).wait_list), \
  5. .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
  6. __RWSEM_OPT_INIT(name) \
  7. __RWSEM_DEP_MAP_INIT(name)
  8. }

and expands to the initialization of fields of rw_semaphore structure. First of all we initialize count field of the rw_semaphore structure to the unlocked state with RWSEM_UNLOCKED_VALUE macro from the arch/x86/include/asm/rwsem.h architecture specific header file:

  1. #define RWSEM_UNLOCKED_VALUE 0x00000000L

After this we initialize list of a lock waiters with the empty linked list and spinlock for protection of this list with the unlocked state too. The __RWSEM_OPT_INIT macro depends on the state of the CONFIG_RWSEM_SPIN_ON_OWNER kernel configuration option and if this option is enabled it expands to the initialization of the osq and owner fields of the rw_semaphore structure. As we already saw above, the CONFIG_RWSEM_SPIN_ON_OWNER kernel configuration option is enabled by default for x86_64 architecture, so let’s take a look at the definition of the __RWSEM_OPT_INIT macro:

  1. #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
  2. #define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
  3. #else
  4. #define __RWSEM_OPT_INIT(lockname)
  5. #endif

As we may see, the __RWSEM_OPT_INIT macro initializes the MCS lock lock with unlocked state and initial owner of a lock with NULL. From this moment, a rw_semaphore structure will be initialized in a compile time and may be used for data protection.

The second way to initialize a rw_semaphore structure is dynamically or use the init_rwsem macro from the include/linux/rwsem.h header file. This macro declares an instance of the lock_class_key which is related to the lock validator of the Linux kernel and to the call of the __init_rwsem function with the given reader/writer semaphore:

  1. #define init_rwsem(sem) \
  2. do { \
  3. static struct lock_class_key __key; \
  4. \
  5. __init_rwsem((sem), #sem, &__key); \
  6. } while (0)

If you will start definition of the __init_rwsem function, you will notice that there are couple of source code files which contain it. As you may guess, sometimes we need to initialize additional fields of the rw_semaphore structure, like the osq and owner. But sometimes not. All of this depends on some kernel configuration options. If we will look at the kernel/locking/Makefile makefile, we will see following lines:

  1. obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
  2. obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o

As we already know, the Linux kernel for x86_64 architecture has enabled CONFIG_RWSEM_XCHGADD_ALGORITHM kernel configuration option by default:

  1. config RWSEM_XCHGADD_ALGORITHM
  2. def_bool 64BIT

in the arch/x86/um/Kconfig kernel configuration file. In this case, implementation of the __init_rwsem function will be located in the kernel/locking/rwsem-xadd.c source code file for us. Let’s take a look at this function:

  1. void __init_rwsem(struct rw_semaphore *sem, const char *name,
  2. struct lock_class_key *key)
  3. {
  4. #ifdef CONFIG_DEBUG_LOCK_ALLOC
  5. debug_check_no_locks_freed((void *)sem, sizeof(*sem));
  6. lockdep_init_map(&sem->dep_map, name, key, 0);
  7. #endif
  8. sem->count = RWSEM_UNLOCKED_VALUE;
  9. raw_spin_lock_init(&sem->wait_lock);
  10. INIT_LIST_HEAD(&sem->wait_list);
  11. #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
  12. sem->owner = NULL;
  13. osq_lock_init(&sem->osq);
  14. #endif
  15. }

We may see here almost the same as in __RWSEM_INITIALIZER macro with difference that all of this will be executed in runtime.

So, from now we are able to initialize a reader/writer semaphore let’s look at the lock and unlock API. The Linux kernel provides following primary API to manipulate reader/writer semaphores:

  • void down_read(struct rw_semaphore *sem) - lock for reading;
  • int down_read_trylock(struct rw_semaphore *sem) - try lock for reading;
  • void down_write(struct rw_semaphore *sem) - lock for writing;
  • int down_write_trylock(struct rw_semaphore *sem) - try lock for writing;
  • void up_read(struct rw_semaphore *sem) - release a read lock;
  • void up_write(struct rw_semaphore *sem) - release a write lock;

Let’s start as always from the locking. First of all let’s consider implementation of the down_write function which executes a try of acquiring of a lock for write. This function is kernel/locking/rwsem.c source code file and starts from the call of the macro from the include/linux/kernel.h header file:

  1. void __sched down_write(struct rw_semaphore *sem)
  2. {
  3. might_sleep();
  4. rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
  5. LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
  6. rwsem_set_owner(sem);
  7. }

We already met the might_sleep macro in the previous part. In short words, Implementation of the might_sleep macro depends on the CONFIG_DEBUG_ATOMIC_SLEEP kernel configuration option and if this option is enabled, this macro just prints a stack trace if it was executed in atomic context. As this macro is mostly for debugging purpose we will skip it and will go ahead. Additionally we will skip the next macro from the down_read function - rwsem_acquire which is related to the lock validator of the Linux kernel, because this is topic of other part.

The only two things that remained in the down_write function is the call of the LOCK_CONTENDED macro which is defined in the include/linux/lockdep.h header file and setting of owner of a lock with the rwsem_set_owner function which sets owner to currently running process:

  1. static inline void rwsem_set_owner(struct rw_semaphore *sem)
  2. {
  3. sem->owner = current;
  4. }

As you already may guess, the LOCK_CONTENDED macro does all job for us. Let’s look at the implementation of the LOCK_CONTENDED macro:

  1. #define LOCK_CONTENDED(_lock, try, lock) \
  2. lock(_lock)

As we may see it just calls the lock function which is third parameter of the LOCK_CONTENDED macro with the given rw_semaphore. In our case the third parameter of the LOCK_CONTENDED macro is the __down_write function which is architecture specific function and located in the arch/x86/include/asm/rwsem.h header file. Let’s look at the implementation of the __down_write function:

  1. static inline void __down_write(struct rw_semaphore *sem)
  2. {
  3. __down_write_nested(sem, 0);
  4. }

which just executes a call of the __down_write_nested function from the same source code file. Let’s take a look at the implementation of the __down_write_nested function:

  1. static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
  2. {
  3. long tmp;
  4. asm volatile("# beginning down_write\n\t"
  5. LOCK_PREFIX " xadd %1,(%2)\n\t"
  6. " test " __ASM_SEL(%w1,%k1) "," __ASM_SEL(%w1,%k1) "\n\t"
  7. " jz 1f\n"
  8. " call call_rwsem_down_write_failed\n"
  9. "1:\n"
  10. "# ending down_write"
  11. : "+m" (sem->count), "=d" (tmp)
  12. : "a" (sem), "1" (RWSEM_ACTIVE_WRITE_BIAS)
  13. : "memory", "cc");
  14. }

As for other synchronization primitives which we saw in this chapter, usually lock/unlock functions consists only from an inline assembly statement. As we may see, in our case the same for __down_write_nested function. Let’s try to understand what does this function do. The first line of our assembly statement is just a comment, let’s skip it. The second like contains LOCK_PREFIX which will be expanded to the LOCK instruction as we already know. The next xadd instruction executes add and exchange operations. In other words, xadd instruction adds value of the RWSEM_ACTIVE_WRITE_BIAS:

  1. #define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
  2. #define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
  3. #define RWSEM_ACTIVE_BIAS 0x00000001L

or 0xffffffff00000001 to the count of the given reader/writer semaphore and returns previous value of it. After this we check the active mask in the rw_semaphore->count. If it was zero before, this means that there were no-one writer before, so we acquired a lock. In other way we call the call_rwsem_down_write_failed function from the arch/x86/lib/rwsem.S assembly file. The the call_rwsem_down_write_failed function just calls the rwsem_down_write_failed function from the kernel/locking/rwsem-xadd.c source code file anticipatorily save general purpose registers:

  1. ENTRY(call_rwsem_down_write_failed)
  2. FRAME_BEGIN
  3. save_common_regs
  4. movq %rax,%rdi
  5. call rwsem_down_write_failed
  6. restore_common_regs
  7. FRAME_END
  8. ret
  9. ENDPROC(call_rwsem_down_write_failed)

The rwsem_down_write_failed function starts from the atomic update of the count value:

  1. __visible
  2. struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
  3. {
  4. count = rwsem_atomic_update(-RWSEM_ACTIVE_WRITE_BIAS, sem);
  5. ...
  6. ...
  7. ...
  8. }

with the -RWSEM_ACTIVE_WRITE_BIAS value. The rwsem_atomic_update function is defined in the arch/x86/include/asm/rwsem.h header file and implement exchange and add logic:

  1. static inline long rwsem_atomic_update(long delta, struct rw_semaphore *sem)
  2. {
  3. return delta + xadd(&sem->count, delta);
  4. }

This function atomically adds the given delta to the count and returns old value of the count. After this it just returns sum of the given delta and old value of the count field. In our case we undo write bias from the count as we didn’t acquire a lock. After this step we try to do optimistic spinning by the call of the rwsem_optimistic_spin function:

  1. if (rwsem_optimistic_spin(sem))
  2. return sem;

We will skip implementation of the rwsem_optimistic_spin function, as it is similar on the mutex_optimistic_spin function which we saw in the previous part. In short words we check existence other tasks ready to run that have higher priority in the rwsem_optimistic_spin function. If there are such tasks, the process will be added to the MCS waitqueue and start to spin in the loop until a lock will be able to be acquired. If optimistic spinning is disabled, a process will be added to the and marked as waiting for write:

  1. waiter.task = current;
  2. waiter.type = RWSEM_WAITING_FOR_WRITE;
  3. if (list_empty(&sem->wait_list))
  4. waiting = false;
  5. list_add_tail(&waiter.list, &sem->wait_list);

waiters list and start to wait until it will successfully acquire the lock. After we have added a process to the waiters list which was empty before this moment, we update the value of the rw_semaphore->count with the RWSEM_WAITING_BIAS:

  1. count = rwsem_atomic_update(RWSEM_WAITING_BIAS, sem);

with this we mark rw_semaphore->counter that it is already locked and exists/waits one writer which wants to acquire the lock. In other way we try to wake reader processes from the wait queue that were queued before this writer process and there are no active readers. In the end of the rwsem_down_write_failed a writer process will go to sleep which didn’t acquire a lock in the following loop:

  1. while (true) {
  2. if (rwsem_try_write_lock(count, sem))
  3. break;
  4. raw_spin_unlock_irq(&sem->wait_lock);
  5. do {
  6. schedule();
  7. set_current_state(TASK_UNINTERRUPTIBLE);
  8. } while ((count = sem->count) & RWSEM_ACTIVE_MASK);
  9. raw_spin_lock_irq(&sem->wait_lock);
  10. }

I will skip explanation of this loop as we already met similar functional in the previous part.

That’s all. From this moment, our writer process will acquire or not acquire a lock depends on the value of the rw_semaphore->count field. Now if we will look at the implementation of the down_read function which executes a try of acquiring of a lock. We will see similar actions which we saw in the down_write function. This function calls different debugging and lock validator related functions/macros:

  1. void __sched down_read(struct rw_semaphore *sem)
  2. {
  3. might_sleep();
  4. rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
  5. LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
  6. }

and does all job in the __down_read function. The __down_read consists of inline assembly statement:

  1. static inline void __down_read(struct rw_semaphore *sem)
  2. {
  3. asm volatile("# beginning down_read\n\t"
  4. LOCK_PREFIX _ASM_INC "(%1)\n\t"
  5. " jns 1f\n"
  6. " call call_rwsem_down_read_failed\n"
  7. "1:\n\t"
  8. "# ending down_read\n\t"
  9. : "+m" (sem->count)
  10. : "a" (sem)
  11. : "memory", "cc");
  12. }

which increments value of the given rw_semaphore->count and call the call_rwsem_down_read_failed if this value is negative. In other way we jump at the label 1: and exit. After this read lock will be successfully acquired. Notice that we check a sign of the count value as it may be negative, because as you may remember most significant word of the rw_semaphore->count contains negated number of active writers.

Let’s consider case when a process wants to acquire a lock for read operation, but it is already locked. In this case the call_rwsem_down_read_failed function from the arch/x86/lib/rwsem.S assembly file will be called. If you will look at the implementation of this function, you will notice that it does the same that call_rwsem_down_read_failed function does. Except it calls the rwsem_down_read_failed function instead of rwsem_dow_write_failed. Now let’s consider implementation of the rwsem_down_read_failed function. It starts from the adding a process to the wait queue and updating of value of the rw_semaphore->counter:

  1. long adjustment = -RWSEM_ACTIVE_READ_BIAS;
  2. waiter.task = tsk;
  3. waiter.type = RWSEM_WAITING_FOR_READ;
  4. if (list_empty(&sem->wait_list))
  5. adjustment += RWSEM_WAITING_BIAS;
  6. list_add_tail(&waiter.list, &sem->wait_list);
  7. count = rwsem_atomic_update(adjustment, sem);

Notice that if the wait queue was empty before we clear the rw_semaphore->counter and undo read bias in other way. At the next step we check that there are no active locks and we are first in the wait queue we need to join currently active reader processes. In other way we go to sleep until a lock will not be able to acquired.

That’s all. Now we know how reader and writer processes will behave in different cases during a lock acquisition. Now let’s take a short look at unlock operations. The up_read and up_write functions allows us to unlock a reader or writer lock. First of all let’s take a look at the implementation of the up_write function which is defined in the kernel/locking/rwsem.c source code file:

  1. void up_write(struct rw_semaphore *sem)
  2. {
  3. rwsem_release(&sem->dep_map, 1, _RET_IP_);
  4. rwsem_clear_owner(sem);
  5. __up_write(sem);
  6. }

First of all it calls the rwsem_release macro which is related to the lock validator of the Linux kernel, so we will skip it now. And at the next line the rwsem_clear_owner function which as you may understand from the name of this function, just clears the owner field of the given rw_semaphore:

  1. static inline void rwsem_clear_owner(struct rw_semaphore *sem)
  2. {
  3. sem->owner = NULL;
  4. }

The __up_write function does all job of unlocking of the lock. The _up_write is architecture-specific function, so for our case it will be located in the arch/x86/include/asm/rwsem.h source code file. If we will take a look at the implementation of this function, we will see that it does almost the same that __down_write function, but conversely. Instead of adding of the RWSEM_ACTIVE_WRITE_BIAS to the count, we subtract the same value and check the sign of the previous value.

If the previous value of the rw_semaphore->count is not negative, a writer process released a lock and now it may be acquired by someone else. In other case, the rw_semaphore->count will contain negative values. This means that there is at least one writer in a wait queue. In this case the call_rwsem_wake function will be called. This function acts like similar functions which we already saw above. It store general purpose registers at the stack for preserving and call the rwsem_wake function.

First of all the rwsem_wake function checks if a spinner is present. In this case it will just acquire a lock which is just released by lock owner. In other case there must be someone in the wait queue and we need to wake or writer process if it exists at the top of the wait queue or all reader processes. The up_read function which release a reader lock acts in similar way like up_write, but with a little difference. Instead of subtracting of RWSEM_ACTIVE_WRITE_BIAS from the rw_semaphore->count, it subtracts 1 from it, because less significant word of the count contains number active locks. After this it checks sign of the count and calls the rwsem_wake like __up_write if the count is negative or in other way lock will be successfully released.

That’s all. We have considered API for manipulation with reader/writer semaphore: up_read/up_write and down_read/down_write. We saw that the Linux kernel provides additional API, besides this functions, like the , and etc. But I will not consider implementation of these function in this part because it must be similar on that we have seen in this part of except few subtleties.

Conclusion

This is the end of the fifth part of the synchronization primitives chapter in the Linux kernel. In this part we met with special type of semaphore - readers/writer semaphore which provides access to data for multiply process to read or for one process to writer. In the next part we will continue to dive into synchronization primitives in the Linux kernel.

If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me anotherworldofworld@gmail.com">email or just create issue.

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.