Ataraxia through Epoché: [Paper read] BP-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free

[Paper] BP-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free
http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf

Issues

In a multi-core processors environment, a new challenge for buffer management to address is to retain its scalability in responding to the highly concurrent data processing demands.

The page replacement algorithm, a major component in the buffer management, can seriously degrade the system’s performance if the algorithm is not implemented in a scalable way.
A lock-protected data structure is used in most replacement algorithms, where high contention is caused by concurrent accesses.

A common practice is to modify a replacement algorithm to reduce the contention, such as to approximate the LRU replacement with the clock algorithm.

However, this hurts the hit-ratio.

In this paper, authors introduced batching and prefetching techniques to reduce lock contention and to retain high hit ratio.

Introduction

How to get replacement algorithm makes replacement decisions both in an effective and scalable way.
Disk I/O is expensive.
Virtual memory, and I/O buffers, focusing on the improvement of hit ratios by organizing and managing deep page access history.
Representative algorithms include those that address the weaknesses of LRU, such as LIRS , 2Q , and ARC.

The above algorithms needs a lock to hold the latch for updating the cache.

Performance degradation due to lock contention can be significant in large-scale systems.

This subject has been a major research issue for years.

Paper's experiments show that contention on the lock associated with replacement algorithms may reduce database throughput by nearly two folds in a 16-processor system.

Two factors contributing to the performance degradation.

One factor is the frequency of lock requests. Currently, the lock is globally shared by all transaction-processing threads to coordinate their page accesses. It is requested every time a thread accesses a page.
Another factor is the cost to acquire a lock, including changing lock state, busy-waiting, and/or context switches.

In general, lock contention can be reduced by using distributed locks to reduce lock granularity .

However, as the recorded history information is localized to each partition, the lack of global history information can be harmful to the performance of the replacement algorithms.

Solution

batch execution, which amortizes lock contention overhead among a batch of page accesses.
Using private ring-buffer for each thread vs. shared ring-buffer for all threads, choosing former reduce complexity and performance.
Also, private ring-buffer preserves the sequence for each thread.
prefetching, which reduces the average lock-holding time by pre-loading necessary data for the replacement algorithm into the processor cache.
As for prefetching, is closely related to CPU cache. Before lock required by a Core into its cache,the thread reads the about to be modified data into CPU's cache, i.e in LRU, the pointers of head node's forward/backward pointers.
After a WMB(write-memory barrier) which triggered by a lock, data is written into store-buffer and invalid-queue request has sent.
For prefetched data(pointers in LRC), the prefetch is a sheer read into CPU catch without modification.
Once the CPU has the lock, re-read the prefetched data, if the prefecthed data is invalidated, CPU will re-read data from other Cores; however, if not, the data is used.