Python3實(shí)現(xiàn)打格點(diǎn)算法的GPU加速實(shí)例詳解
技術(shù)背景
在數(shù)學(xué)和物理學(xué)領(lǐng)域,總是充滿了各種連續(xù)的函數(shù)模型。而當(dāng)我們用現(xiàn)代計(jì)算機(jī)的技術(shù)去處理這些問(wèn)題的時(shí)候,事實(shí)上是無(wú)法直接處理連續(xù)模型的,絕大多數(shù)的情況下都要轉(zhuǎn)化成一個(gè)離散的模型再進(jìn)行數(shù)值的計(jì)算。比如計(jì)算數(shù)值的積分,計(jì)算數(shù)值的二階導(dǎo)數(shù)(海森矩陣)等等。這里我們所介紹的打格點(diǎn)的算法,正是一種典型的離散化方法。這個(gè)對(duì)空間做離散化的方法,可以在很大程度上簡(jiǎn)化運(yùn)算量。比如在分子動(dòng)力學(xué)模擬中,計(jì)算近鄰表的時(shí)候,如果不采用打格點(diǎn)的方法,那么就要針對(duì)整個(gè)空間所有的原子進(jìn)行搜索,計(jì)算出來(lái)距離再判斷是否近鄰。而如果采用打格點(diǎn)的方法,我們只需要先遍歷一遍原子對(duì)齊進(jìn)行打格點(diǎn)的離散化,之后再計(jì)算近鄰表的時(shí)候,只需要計(jì)算三維空間下鄰近的27個(gè)格子中的原子是否滿足近鄰條件即可。在這篇文章中,我們主要探討如何用GPU來(lái)實(shí)現(xiàn)打格點(diǎn)的算法。
打格點(diǎn)算法實(shí)現(xiàn)
我們先來(lái)用一個(gè)例子說(shuō)明一下什么叫打格點(diǎn)。對(duì)于一個(gè)給定所有原子坐標(biāo)的系統(tǒng),也就是已知了[x,y,z],我們需要得到的是這些原子所在的對(duì)應(yīng)的格子位置[nx,ny,nz]。我們先看一下在CPU上的實(shí)現(xiàn)方案,是一個(gè)遍歷一次的算法:
# cuda_grid.py from numba import jit from numba import cuda import numpy as np def grid_by_cpu(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids if __name__=='__main__': np.random.seed(1) atoms = 4 grid_size = 0.1 crd = np.random.random((atoms,3)).astype(np.float32) xmin = min(crd[:,0]) ymin = min(crd[:,1]) zmin = min(crd[:,2]) xmax = max(crd[:,0]) ymax = max(crd[:,1]) zmax = max(crd[:,2]) xgrids = int((xmax-xmin)/grid_size)+1 ygrids = int((ymax-ymin)/grid_size)+1 zgrids = int((zmax-zmin)/grid_size)+1 rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32) grids = np.ones_like(crd)*(-1) grids = grids.astype(np.float32) grids_cpu = grid_by_cpu(crd, rxyz, atoms, grids) print (crd) print (grids_cpu) import matplotlib.pyplot as plt plt.figure() plt.plot(crd[:,0], crd[:,1], 'o', color='red') for grid in range(ygrids+1): plt.plot([xmin,xmin+grid_size*xgrids], [ymin+grid_size*grid,ymin+grid_size*grid], color='black') for grid in range(xgrids+1): plt.plot([xmin+grid_size*grid,xmin+grid_size*grid], [ymin,ymin+grid_size*ygrids], color='black') plt.savefig('Atom_Grids.png')
輸出結(jié)果如下,
$ python3 cuda_grid.py
[[4.17021990e-01 7.20324516e-01 1.14374816e-04]
[3.02332580e-01 1.46755889e-01 9.23385918e-02]
[1.86260208e-01 3.45560730e-01 3.96767467e-01]
[5.38816750e-01 4.19194520e-01 6.85219526e-01]]
[[2. 5. 0.]
[1. 0. 0.]
[0. 1. 3.]
[3. 2. 6.]]
上面兩個(gè)打印輸出就分別對(duì)應(yīng)于[x,y,z]和[nx,ny,nz],比如第一個(gè)原子被放到了編號(hào)為[2,5,0]的格點(diǎn)。那么為了方便理解打格點(diǎn)的方法,我們把這個(gè)三維空間的原子系統(tǒng)和打格點(diǎn)以后的標(biāo)號(hào)取前兩個(gè)維度來(lái)可視化一下結(jié)果,作圖以后效果如下:
我們可以看到,這些紅色的點(diǎn)就是原子所處的位置,而黑色的網(wǎng)格線就是我們所標(biāo)記的格點(diǎn)。在原子數(shù)量比較多的時(shí)候,有可能出現(xiàn)在一個(gè)網(wǎng)格中存在很多個(gè)原子的情況,所以如何打格點(diǎn),格點(diǎn)大小如何去定義,這都是不同場(chǎng)景下的經(jīng)驗(yàn)參數(shù),需要大家一起去摸索。
打格點(diǎn)算法加速
在上面這個(gè)算法實(shí)現(xiàn)中,我們主要是用到了一個(gè)for循環(huán),這時(shí)候我們可以想到numba所支持的向量化運(yùn)算,還有GPU硬件加速,這里我們先對(duì)比一下三種實(shí)現(xiàn)方案的計(jì)算結(jié)果:
# cuda_grid.py from numba import jit from numba import cuda import numpy as np def grid_by_cpu(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids @jit def grid_by_jit(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids @cuda.jit def grid_by_gpu(crd, rxyz, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ i,j = cuda.grid(2) grids[i][j] = int((crd[i][j]-rxyz[j])/rxyz[3]) if __name__=='__main__': np.random.seed(1) atoms = 4 grid_size = 0.1 crd = np.random.random((atoms,3)).astype(np.float32) xmin = min(crd[:,0]) ymin = min(crd[:,1]) zmin = min(crd[:,2]) xmax = max(crd[:,0]) ymax = max(crd[:,1]) zmax = max(crd[:,2]) xgrids = int((xmax-xmin)/grid_size)+1 ygrids = int((ymax-ymin)/grid_size)+1 zgrids = int((zmax-zmin)/grid_size)+1 rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32) crd_cuda = cuda.to_device(crd) rxyz_cuda = cuda.to_device(rxyz) grids = np.ones_like(crd)*(-1) grids = grids.astype(np.float32) grids_cpu = grid_by_cpu(crd, rxyz, atoms, grids) grids = np.ones_like(crd)*(-1) grids_jit = grid_by_jit(crd, rxyz, atoms, grids) grids = np.ones_like(crd)*(-1) grids_cuda = cuda.to_device(grids) grid_by_gpu[(atoms,3),(1,1)](crd_cuda, rxyz_cuda, grids_cuda) print (crd) print (grids_cpu) print (grids_jit) print (grids_cuda.copy_to_host())
輸出結(jié)果如下:
$ python3 cuda_grid.py
/home/dechin/anaconda3/lib/python3.8/site-packages/numba/cuda/compiler.py:865: NumbaPerformanceWarning: Grid size (12) < 2 * SM count (72) will likely result in GPU under utilization due to low occupancy.
warn(NumbaPerformanceWarning(msg))
[[4.17021990e-01 7.20324516e-01 1.14374816e-04]
[3.02332580e-01 1.46755889e-01 9.23385918e-02]
[1.86260208e-01 3.45560730e-01 3.96767467e-01]
[5.38816750e-01 4.19194520e-01 6.85219526e-01]]
[[2. 5. 0.]
[1. 0. 0.]
[0. 1. 3.]
[3. 2. 6.]]
[[2. 5. 0.]
[1. 0. 0.]
[0. 1. 3.]
[3. 2. 6.]]
[[2. 5. 0.]
[1. 0. 0.]
[0. 1. 3.]
[3. 2. 6.]]
我們先看到這里面的告警信息,因?yàn)镚PU硬件加速要在一定密度的運(yùn)算量之上才能夠有比較明顯的加速效果。比如說(shuō)我們只是計(jì)算兩個(gè)數(shù)字的加和,那么是完全沒(méi)有必要使用到GPU的。但是如果我們要計(jì)算兩個(gè)非常大的數(shù)組的加和,那么這個(gè)時(shí)候GPU就能夠發(fā)揮出非常大的價(jià)值。因?yàn)檫@里我們的案例中只有4個(gè)原子,因此提示我們這時(shí)候是體現(xiàn)不出來(lái)GPU的加速效果的。我們僅僅關(guān)注下這里的運(yùn)算結(jié)果,在不同體系下得到的格點(diǎn)結(jié)果是一致的,那么接下來(lái)就可以對(duì)比一下幾種不同實(shí)現(xiàn)方式的速度差異。
# cuda_grid.py from numba import jit from numba import cuda import numpy as np def grid_by_cpu(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids @jit def grid_by_jit(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids @cuda.jit def grid_by_gpu(crd, rxyz, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ i,j = cuda.grid(2) grids[i][j] = int((crd[i][j]-rxyz[j])/rxyz[3]) if __name__=='__main__': import time from tqdm import trange np.random.seed(1) atoms = 100000 grid_size = 0.1 crd = np.random.random((atoms,3)).astype(np.float32) xmin = min(crd[:,0]) ymin = min(crd[:,1]) zmin = min(crd[:,2]) xmax = max(crd[:,0]) ymax = max(crd[:,1]) zmax = max(crd[:,2]) xgrids = int((xmax-xmin)/grid_size)+1 ygrids = int((ymax-ymin)/grid_size)+1 zgrids = int((zmax-zmin)/grid_size)+1 rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32) crd_cuda = cuda.to_device(crd) rxyz_cuda = cuda.to_device(rxyz) cpu_time = 0 jit_time = 0 gpu_time = 0 for i in trange(100): grids = np.ones_like(crd)*(-1) grids = grids.astype(np.float32) time0 = time.time() grids_cpu = grid_by_cpu(crd, rxyz, atoms, grids) time1 = time.time() grids = np.ones_like(crd)*(-1) time2 = time.time() grids_jit = grid_by_jit(crd, rxyz, atoms, grids) time3 = time.time() grids = np.ones_like(crd)*(-1) grids_cuda = cuda.to_device(grids) time4 = time.time() grid_by_gpu[(atoms,3),(1,1)](crd_cuda, rxyz_cuda, grids_cuda) time5 = time.time() if i != 0: cpu_time += time1 - time0 jit_time += time3 - time2 gpu_time += time5 - time4 print ('The time cost of CPU calculation is: {}s'.format(cpu_time)) print ('The time cost of JIT calculation is: {}s'.format(jit_time)) print ('The time cost of GPU calculation is: {}s'.format(gpu_time))
輸出結(jié)果如下:
$ python3 cuda_grid.py
100%|███████████████████████████| 100/100 [00:23<00:00, 4.18it/s]
The time cost of CPU calculation is: 23.01943016052246s
The time cost of JIT calculation is: 0.04810166358947754s
The time cost of GPU calculation is: 0.01806473731994629s
在100000個(gè)原子的體系規(guī)模下,普通的for循環(huán)實(shí)現(xiàn)效率就非常的低下,需要23s,而經(jīng)過(guò)向量化運(yùn)算的加速之后,直接飛升到了0.048s,而GPU上的加速更是達(dá)到了0.018s,相比于沒(méi)有GPU硬件加速的場(chǎng)景,實(shí)現(xiàn)了將近2倍的加速。但是這還遠(yuǎn)遠(yuǎn)不是GPU加速的上限,讓我們?cè)贉y(cè)試一個(gè)更大的案例:
# cuda_grid.py from numba import jit from numba import cuda import numpy as np def grid_by_cpu(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids @jit def grid_by_jit(crd, rxyz, atoms, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ for i in range(atoms): grids[i][0] = int((crd[i][0]-rxyz[0])/rxyz[3]) grids[i][1] = int((crd[i][1]-rxyz[1])/rxyz[3]) grids[i][2] = int((crd[i][2]-rxyz[2])/rxyz[3]) return grids @cuda.jit def grid_by_gpu(crd, rxyz, grids): """Transform coordinates [x,y,z] into grids [nx,ny,nz]. Args: crd(list): The 3-D coordinates of atoms. rxyz(list): The list includes xmin,ymin,zmin,grid_num. atoms(int): The total number of atoms. grids(list): The transformed grids matrix. """ i,j = cuda.grid(2) grids[i][j] = int((crd[i][j]-rxyz[j])/rxyz[3]) if __name__=='__main__': import time from tqdm import trange np.random.seed(1) atoms = 5000000 grid_size = 0.1 crd = np.random.random((atoms,3)).astype(np.float32) xmin = min(crd[:,0]) ymin = min(crd[:,1]) zmin = min(crd[:,2]) xmax = max(crd[:,0]) ymax = max(crd[:,1]) zmax = max(crd[:,2]) xgrids = int((xmax-xmin)/grid_size)+1 ygrids = int((ymax-ymin)/grid_size)+1 zgrids = int((zmax-zmin)/grid_size)+1 rxyz = np.array([xmin,ymin,zmin,grid_size], dtype=np.float32) crd_cuda = cuda.to_device(crd) rxyz_cuda = cuda.to_device(rxyz) jit_time = 0 gpu_time = 0 for i in trange(100): grids = np.ones_like(crd)*(-1) time2 = time.time() grids_jit = grid_by_jit(crd, rxyz, atoms, grids) time3 = time.time() grids = np.ones_like(crd)*(-1) grids_cuda = cuda.to_device(grids) time4 = time.time() grid_by_gpu[(atoms,3),(1,1)](crd_cuda, rxyz_cuda, grids_cuda) time5 = time.time() if i != 0: jit_time += time3 - time2 gpu_time += time5 - time4 print ('The time cost of JIT calculation is: {}s'.format(jit_time)) print ('The time cost of GPU calculation is: {}s'.format(gpu_time))
在這個(gè)5000000個(gè)原子的案例中,因?yàn)槠胀ǖ膄or循環(huán)已經(jīng)實(shí)在是跑不動(dòng)了,因此我們就干脆不統(tǒng)計(jì)這一部分的時(shí)間,最后輸出結(jié)果如下:
$ python3 cuda_grid.py
100%|███████████████████████████| 100/100 [00:09<00:00, 10.15it/s]
The time cost of JIT calculation is: 2.3743042945861816s
The time cost of GPU calculation is: 0.022843599319458008s
在如此大規(guī)模的運(yùn)算下,GPU實(shí)現(xiàn)100倍的加速,而此時(shí)作為對(duì)比的CPU上的實(shí)現(xiàn)方法是已經(jīng)用上了向量化運(yùn)算的操作,也已經(jīng)可以認(rèn)為是一個(gè)極致的加速了。
總結(jié)概要
在這篇文章中,我們主要介紹了打格點(diǎn)算法在分子動(dòng)力學(xué)模擬中的重要價(jià)值,以及幾種不同的實(shí)現(xiàn)方式。其中最普通的for循環(huán)的實(shí)現(xiàn)效率比較低下,從算法復(fù)雜度上來(lái)講卻已經(jīng)是極致。而基于CPU上的向量化運(yùn)算的技術(shù),可以對(duì)計(jì)算過(guò)程進(jìn)行非常深度的優(yōu)化。當(dāng)然,這個(gè)案例在不同的硬件上也能夠發(fā)揮出明顯不同的加速效果,在GPU的加持之下,可以獲得100倍以上的加速效果。這也是一個(gè)在Python上實(shí)現(xiàn)GPU加速算法的一個(gè)典型案例。
到此這篇關(guān)于Python3實(shí)現(xiàn)打格點(diǎn)算法的GPU加速的文章就介紹到這了,更多相關(guān)Python3實(shí)現(xiàn)打格點(diǎn)算法內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- Python基于pyCUDA實(shí)現(xiàn)GPU加速并行計(jì)算功能入門(mén)教程
- 關(guān)于Python的GPU編程實(shí)例近鄰表計(jì)算的講解
- Python實(shí)現(xiàn)GPU加速的基本操作
- GPU排隊(duì)腳本實(shí)現(xiàn)空閑觸發(fā)python腳本實(shí)現(xiàn)示例
- python 詳解如何使用GPU大幅提高效率
- python沒(méi)有g(shù)pu,如何改用cpu跑代碼
- 淺談Python實(shí)時(shí)檢測(cè)CPU和GPU的功耗
- 一文詳解如何用GPU來(lái)運(yùn)行Python代碼
- Python Pytorch gpu 分析環(huán)境配置
- 利用Python進(jìn)行全面的GPU環(huán)境檢測(cè)與分析
- Python調(diào)用GPU算力的實(shí)現(xiàn)步驟
相關(guān)文章
python使用sklearn實(shí)現(xiàn)決策樹(shù)的方法示例
這篇文章主要介紹了python使用sklearn實(shí)現(xiàn)決策樹(shù)的方法示例,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2019-09-09簡(jiǎn)單了解python調(diào)用其他腳本方法實(shí)例
這篇文章主要介紹了簡(jiǎn)單了解python調(diào)用其他腳本方法實(shí)例,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2020-03-03Python(PyS60)實(shí)現(xiàn)簡(jiǎn)單語(yǔ)音整點(diǎn)報(bào)時(shí)
這篇文章主要為大家詳細(xì)介紹了Python(PyS60)實(shí)現(xiàn)簡(jiǎn)單語(yǔ)音整點(diǎn)報(bào)時(shí),文中示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2019-11-11Python for循環(huán)通過(guò)序列索引迭代過(guò)程解析
這篇文章主要介紹了Python for循環(huán)通過(guò)序列索引迭代過(guò)程解析,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2020-02-02Python創(chuàng)建普通菜單示例【基于win32ui模塊】
這篇文章主要介紹了Python創(chuàng)建普通菜單,結(jié)合實(shí)例形式分析了Python基于win32ui模塊創(chuàng)建普通菜單及添加菜單項(xiàng)的相關(guān)操作技巧,并附帶說(shuō)明了win32ui模塊的安裝命令,需要的朋友可以參考下2018-05-05使用OpenCV實(shí)現(xiàn)讀取和顯示圖像與視頻
OpenCV 是一個(gè)強(qiáng)大的計(jì)算機(jī)視覺(jué)庫(kù),廣泛應(yīng)用于圖像處理和視頻處理等領(lǐng)域,本文將詳細(xì)介紹如何使用 OpenCV 在 Python 中讀取和顯示圖像以及視頻,希望對(duì)大家有所幫助2024-11-11利用Tensorboard繪制網(wǎng)絡(luò)識(shí)別準(zhǔn)確率和loss曲線實(shí)例
今天小編就為大家分享一篇利用Tensorboard繪制網(wǎng)絡(luò)識(shí)別準(zhǔn)確率和loss曲線實(shí)例,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2020-02-02python中opencv圖像疊加、圖像融合、按位操作的具體實(shí)現(xiàn)
opencv圖像操作可以更好更快的方便我們處理圖片,本文主要介紹了圖像疊加、圖像融合、按位操作,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2021-07-07