[Paper] BP-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free
http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf
A common practice is to modify a replacement algorithm to reduce the contention, such as to approximate the LRU replacement with the clock algorithm.
However, this hurts the hit-ratio.
In this paper, authors introduced batching and prefetching techniques to reduce lock contention and to retain high hit ratio.
http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf
Issues
In a multi-core processors environment, a new challenge for buffer management to address is to retain its scalability in responding to the highly concurrent data processing demands.- The page replacement algorithm, a major component in the buffer management, can seriously degrade the system’s performance if the algorithm is not implemented in a scalable way.
- A lock-protected data structure is used in most replacement algorithms, where high contention is caused by concurrent accesses.
A common practice is to modify a replacement algorithm to reduce the contention, such as to approximate the LRU replacement with the clock algorithm.
However, this hurts the hit-ratio.
In this paper, authors introduced batching and prefetching techniques to reduce lock contention and to retain high hit ratio.
Introduction
- How to get replacement algorithm makes replacement decisions both in an effective and scalable way.
- Disk I/O is expensive.
- Virtual memory, and I/O buffers, focusing on the improvement of hit ratios by organizing and managing deep page access history.
- Representative algorithms include those that address the weaknesses of LRU, such as LIRS , 2Q , and ARC.
The above algorithms needs a lock to hold the latch for updating the cache.
Performance degradation due to lock contention can be significant in large-scale systems.
This subject has been a major research issue for years.
Paper's experiments show that contention on the lock associated with replacement algorithms may reduce database throughput by nearly two folds in a 16-processor system.
Two factors contributing to the performance degradation.
- One factor is the frequency of lock requests. Currently, the lock is globally shared by all transaction-processing threads to coordinate their page accesses. It is requested every time a thread accesses a page.
- Another factor is the cost to acquire a lock, including changing lock state, busy-waiting, and/or context switches.
In general, lock contention can be reduced by using distributed locks to reduce lock granularity .
However, as the recorded history information is localized to each partition, the lack of global history information can be harmful to the performance of the replacement algorithms.
Solution
- batch execution, which amortizes lock contention overhead among a batch of page accesses.
Using private ring-buffer for each thread vs. shared ring-buffer for all threads, choosing former reduce complexity and performance.
Also, private ring-buffer preserves the sequence for each thread. - prefetching, which reduces the average lock-holding time by pre-loading necessary data for the replacement algorithm into the processor cache.
As for prefetching, is closely related to CPU cache. Before lock required by a Core into its cache,the thread reads the about to be modified data into CPU's cache, i.e in LRU, the pointers of head node's forward/backward pointers.
After a WMB(write-memory barrier) which triggered by a lock, data is written into store-buffer and invalid-queue request has sent.
For prefetched data(pointers in LRC), the prefetch is a sheer read into CPU catch without modification.
Once the CPU has the lock, re-read the prefetched data, if the prefecthed data is invalidated, CPU will re-read data from other Cores; however, if not, the data is used.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.