
Warp is the basic unit for task scheduling and execution. In order to make a large number of threads run efficiently, multiple consecutive threads are divided into a thread group, called warp or wavefront. That is, all threads execute the same instruction on different data, which can provide better performance improvements and multi-core CPU efficiency. In addition, compared with the CPU, GPGPU has more processing units to support its single instruction multi-thread execution model. By quickly switching the context between different threads, it can hide the long delay caused by operations such as memory access. The high performance of GPGPU lies in its huge multi-threaded architecture. The general-purpose graphics processing unit (GPGPU) is one of the most mainstream acceleration components in the field of throughput-oriented high-performance computing. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance.

That makes the cache suffer a large amount of cache contention and pipeline stall.

The L1 data caches have little capacity, while multiple warps share one small cache. In the GPU, multiple threads are divided into one warp for scheduling and execution. The long latency of memory operations is the bottleneck of GPU performance. GPGPUs has gradually become a mainstream acceleration component in high-performance computing.
