<acronym id="s8ci2"><small id="s8ci2"></small></acronym>

<rt id="s8ci2"></rt><rt id="s8ci2"><optgroup id="s8ci2"></optgroup></rt>

<acronym id="s8ci2"></acronym>

<acronym id="s8ci2"><center id="s8ci2"></center></acronym>

搜索歷史

清空

搜索熱詞

0

聊天消息
系統消息
評論與回復

查看更多

查看更多

查看更多

VIP于到期續費

登錄后你可以

下載海量資料
學習在線課程
觀看技術視頻
寫文章/發帖/加入社區

會員中心

創作中心

發布

創作活動

完善資料讓更多小伙伴認識你，還能領取20積分哦，立即完善>

3天內不再提示

在AMD GPU上如何安裝和配置triton？

OpenAI/Triton MLIR 第四章: ROCm-triton配置

最近在整理python-based的benchmark代碼，反過來在NV的GPU上又把Triton裝了一遍，發現Triton的github repo已經給出了對應的llvm的commit id以及對應的編譯細節，然后跟著走了一遍，也順利的安裝成功，只需要按照如下方式即可完成NV GPU上的安裝，

1.gitclonehttps://github.com/openai/triton.git;
2.cdtriton;
3.cd$HOME/llvm-project#yourcloneofLLVM.
4.gitcheckout49af6502
5.mkdirbuild
6.cdbuild
7.cmake-GNinja-DCMAKE_BUILD_TYPE=Release-DLLVM_ENABLE_ASSERTIONS=ON../llvm-DLLVM_ENABLE_PROJECTS="mlir;llvm"
8.ninja-j8

exportLLVM_BUILD_DIR=$HOME/llvm-project/build

cd
LLVM_INCLUDE_DIRS=$LLVM_BUILD_DIR/include
LLVM_LIBRARY_DIR=$LLVM_BUILD_DIR/lib
LLVM_SYSPATH=$LLVM_BUILD_DIR
pipinstall-epython

出現3.0.0說明triton已經安裝成功了，裝完triton后一定要安裝Torch，為個人使用的是CUDA 12.1版本，按照下面的命令無腦安裝即可。

pipinstalltorch==2.1.2torchvision==0.16.2torchaudio==2.1.2--index-urlhttps://download.pytorch.org/whl/cu121

NV GPU上triton的安裝和使用其實已經輕車熟路了，接下來，讓我們來探索一下AMD GPU上如何安裝和配置triton。

0x00 軟件安裝

關于triton amd的backend，雖然triton的官方將其作為third-party來進行支持，但是我還是推薦大家使用AMD專門維護的一套triton版本，因為在最開始的官方triton的main分支下，開啟 TRITON_CODEGEN_AMD_HIP_BACKEND=1 沒有正確完成編譯。所以找到了

按照對應的安裝流程進行安裝即可，我推薦使用如下命令進行安裝，親測有效

1.gitclonehttps://github.com/ROCmSoftwarePlatform/triton.git
2.cdtriton
3.gitcheckouttriton-mlir

這里已經準備好了需要編譯的triton，但是triton后端是基于LLVM的，所以要想借助triton去生成可以跑在對應設備上的代碼，我們還需要對LLVM進行編譯，本教程中將會手動編譯LLVM，當然如果你選擇直接編譯好的LLVM也是沒有問題的。關于LLVM，由于triton是基于b1115f8c這個commit id進行開發的，那么我們只需要將LLVM clone下來后，checkout到對應的commit id，然后按照如下完整命令進行編譯即可。

1.gitclonehttps://github.com/llvm/llvm-project
2.gitcheckoutb1115f8c
3.cdllvm-project
4.mkdirbuild
5.cdbuild
6.cmake-GNinja-DCMAKE_BUILD_TYPE=Release-DLLVM_ENABLE_ASSERTIONS=ON../llvm-DLLVM_ENABLE_PROJECTS="mlir;llvm"
7.ninja-j8

等LLVM全部裝好后，就可以去將當前這個LLVM的路徑寫入到你的bashrc下

exportPATH=/home/llvm-project/build/bin:$PATH

然后進入到一開始clone下來的triton目錄下進行如下命令

1.cdtriton
2.vimCMakeLists.txt(option(TRITON_BUILD_PYTHON_MODULE"BuildPythonTritonbindings"ON))
3.mkdirbuild
4.cdbuild
5.cmake..
6.make-j8

在編譯完全正確后，就會在當前的 build 目錄下產生一個 libtriton.so 文件。那么接下來只要將

libtriton.so 文件移動到 triton/python/triton/_C 目錄下，將 triton 的 python 路徑下入 bashrc

exportTRITON_HOME=/home/Documents/compiler/triton
exportPYTHONPATH=$TRITON_HOME/python:${PYTHONPATH}

如果在編譯的過程中出現 goolge test 找不到的情況，按照如下命令進行安裝:

1.gitclonehttps://github.com/google/googletest
2.cdgoogletest
3.cmakeCMakeLists.txt
4.make-j8
5.cp./lib/libgtest*.a/usr/lib
6.cdgoogletest
7.cp–ainclude/gtest/usr/include

如果在編譯的過程中出現 pybind11 找不到的情況，按照如下命令進行按照：

1.pipinstallpytest
2.gitclonehttps://github.com/pybind/pybind11.git
3.cdpybind11
4.mkdirbuild
5.cdbuild
6.cmake..
7.makecheck-j8
8.sudomakeinstal

關于在AMD GPU上的pytorch 一定要去安裝適配 ROCM 版本的 pytorch，由于我的機器使用的是5.6版本的ROCm，所以我的安裝的命令如下，僅供參考:

pip3installtorch==2.1.0torchvision==0.16.0torchaudio==2.1.0--index-url
https://download.pytorch.org/whl/rocm5.6

關于 ROCM 版本可以通過如下命令進行查詢:

dpkg-l|greprocm

這里要記住，pytorch在AMD GPU上的使用和在NV GPU上的使用非常相似，也是用.cuda()來指定變量所在位置。

0x01 GEMM代碼示例

全部編譯好后，就可以通過執行下面的代碼得到對應的 GEMM 在 AMD 顯卡上針對 Triton和 rocBLAS 的 benchmark 了。

importtorch

importtriton
importtriton.languageastl
importsys
importargparse
importpytest

#`triton.jit`'edfunctionscanbeauto-tunedbyusingthe`triton.autotune`decorator,whichconsumes:
#-Alistof`triton.Config`objectsthatdefinedifferentconfigurationsof
#meta-parameters(e.g.,`BLOCK_SIZE_M`)andcompilationoptions(e.g.,`num_warps`)totry
#-Anauto-tuning*key*whosechangeinvalueswilltriggerevaluationofallthe
#providedconfigs
@triton.autotune(
configs=[
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':64,'GROUP_SIZE_M':8},num_stages=3,
num_warps=8),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':32,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':32,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,
num_warps=2),
triton.Config({'BLOCK_SIZE_M':32,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,
num_warps=2),
]iftorch.version.hipisNoneelse[
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':16,'GROUP_SIZE_M':1,'waves_per_eu':2},
num_warps=4,num_stages=0),
triton.Config({'BLOCK_SIZE_M':256,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':16,'GROUP_SIZE_M':4,'waves_per_eu':2},
num_warps=8,num_stages=0),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':1,'waves_per_eu':2},
num_warps=8,num_stages=0),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8,'waves_per_eu':3},
num_warps=4,num_stages=0),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':1,'waves_per_eu':8},
num_warps=4,num_stages=0),
],
key=['M','N','K'],
)
@triton.heuristics({
'EVEN_K':lambdaargs:args['K']%args['BLOCK_SIZE_K']==0,
})
@triton.jit
defmatmul_kernel(
#Pointerstomatrices
a_ptr,b_ptr,c_ptr,
#Matrixdimensions
M,N,K,
#Thestridevariablesrepresenthowmuchtoincreasetheptrbywhenmovingby1
#elementinaparticulardimension.E.g.`stride_am`ishowmuchtoincrease`a_ptr`
#bytogettheelementonerowdown(AhasMrows).
stride_am,stride_ak,
stride_bk,stride_bn,
stride_cm,stride_cn,
#Meta-parameters
BLOCK_SIZE_M:tl.constexpr,BLOCK_SIZE_N:tl.constexpr,BLOCK_SIZE_K:tl.constexpr,
EVEN_K:tl.constexpr,
GROUP_SIZE_M:tl.constexpr,
ACTIVATION:tl.constexpr,
):
"""KernelforcomputingthematmulC=AxB.
Ahasshape(M,K),Bhasshape(K,N)andChasshape(M,N)
"""
#-----------------------------------------------------------
#Mapprogramids`pid`totheblockofCitshouldcompute.
#ThisisdoneinagroupedorderingtopromoteL2datareuse.
#Seeabove`L2CacheOptimizations`sectionfordetails.
pid=tl.program_id(axis=0)
num_pid_m=tl.cdiv(M,BLOCK_SIZE_M)
num_pid_n=tl.cdiv(N,BLOCK_SIZE_N)
ifGROUP_SIZE_M==1:
pid_m=pid//num_pid_n
pid_n=pid%num_pid_n
else:
num_pid_in_group=GROUP_SIZE_M*num_pid_n
group_id=pid//num_pid_in_group
first_pid_m=group_id*GROUP_SIZE_M
group_size_m=min(num_pid_m-first_pid_m,GROUP_SIZE_M)
pid_m=first_pid_m+(pid%group_size_m)
pid_n=(pid%num_pid_in_group)//group_size_m

#----------------------------------------------------------
#CreatepointersforthefirstblocksofAandB.
#WewilladvancethispointeraswemoveintheKdirection
#andaccumulate
#`a_ptrs`isablockof[BLOCK_SIZE_M,BLOCK_SIZE_K]pointers
#`b_ptrs`isablockof[BLOCK_SIZE_K,BLOCK_SIZE_N]pointers
#Seeabove`PointerArithmetics`sectionfordetails
offs_k=tl.arange(0,BLOCK_SIZE_K)
offs_am=(pid_m*BLOCK_SIZE_M+tl.arange(0,BLOCK_SIZE_M))%M
offs_bn=(pid_n*BLOCK_SIZE_N+tl.arange(0,BLOCK_SIZE_N))%N
a_ptrs=a_ptr+(offs_am[:,None]*stride_am+offs_k[None,:]*stride_ak)
b_ptrs=b_ptr+(offs_k[:,None]*stride_bk+offs_bn[None,:]*stride_bn)

#-----------------------------------------------------------
#IteratetocomputeablockoftheCmatrix.
#Weaccumulateintoa`[BLOCK_SIZE_M,BLOCK_SIZE_N]`block
#offp32valuesforhigheraccuracy.
#`accumulator`willbeconvertedbacktofp16aftertheloop.
accumulator=tl.zeros((BLOCK_SIZE_M,BLOCK_SIZE_N),dtype=tl.float32)
forkinrange(0,tl.cdiv(K,BLOCK_SIZE_K)):
#LoadthenextblockofAandB,generateamaskbycheckingtheKdimension.
#Ifitisoutofbounds,setitto0.
ifEVEN_K:
a=tl.load(a_ptrs)
b=tl.load(b_ptrs)
else:
a=tl.load(a_ptrs,mask=offs_k[None,:]=0,x,0.01*x)


#%%
#Wecannowcreateaconveniencewrapperfunctionthatonlytakestwoinputtensors,
#and(1)checksanyshapeconstraint;(2)allocatestheoutput;(3)launchestheabovekernel.


defmatmul(a,b,activation=""):
#Checkconstraints.
asserta.shape[1]==b.shape[0],"Incompatibledimensions"
asserta.is_contiguous(),"MatrixAmustbecontiguous"
assertb.is_contiguous(),"MatrixBmustbecontiguous"
M,K=a.shape
K,N=b.shape
#Allocatesoutput.
c=torch.empty((M,N),device=a.device,dtype=a.dtype)
#1Dlaunchkernelwhereeachblockgetsitsownprogram.
grid=lambdaMETA:(triton.cdiv(M,META['BLOCK_SIZE_M'])*triton.cdiv(N,META['BLOCK_SIZE_N']),)
matmul_kernel[grid](
a,b,c,#
M,N,K,#
a.stride(0),a.stride(1),#
b.stride(0),b.stride(1),#
c.stride(0),c.stride(1),#
ACTIVATION=activation#
)
returnc


#%%
#UnitTest
#---------
#
#Wecantestourcustommatrixmultiplicationoperationagainstanativetorchimplementation(i.e.,cuBLAS).
@pytest.mark.parametrize("M,N,K,in_dtype,out_dtype",
[(*shape,in_dtype,out_dtype)
forshapein[(128,256,32),(128,16,32),(32,128,64),
(128,128,64),(64,128,128),(32,128,64),
(64,64,32),(32,32,128),(128,128,64),
(64,128,128),(512,512,512),(1024,1024,1024)]
forin_dtype,out_dtypein[('int8','int8'),
('float16','float16'),
('bfloat16','bfloat16'),
('float16','float32'),
('float32','float32')]]
)
deftest_correctness(M,N,K,in_dtype,out_dtype):
torch.manual_seed(0)
a=torch.randn((M,K),device='cuda',dtype=torch.float16)
b=torch.randn((K,N),device='cuda',dtype=torch.float16)
triton_output=matmul(a,b)
torch_output=torch.matmul(a,b)
print(f"triton_output={triton_output}")
print(f"torch_output={torch_output}")
rtol=0iftorch.version.hipisNoneelse1e-2
iftorch.allclose(triton_output,torch_output,atol=1e-2,rtol=rtol):
print("TritonandTorchmatch")
else:
print("TritonandTorchdiffer")
asserttorch.allclose(triton_output,torch_output,atol=1e-2,rtol=rtol)


#%%
#Benchmark
#---------
#
#SquareMatrixPerformance
#~~~~~~~~~~~~~~~~~~~~~~~~~~
#
#WecannowcomparetheperformanceofourkernelagainstthatofcuBLAS.Herewefocusonsquarematrices,
#butfeelfreetoarrangethisscriptasyouwishtobenchmarkanyothermatrixshape.

globalverbose
verbose=False

@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['M','N','K'],#Argumentnamestouseasanx-axisfortheplot
x_vals=[
(1024,1024,1024),
(2048,2048,2048),
(4096,4096,4096),
(8192,8192,8192),
(9728,8192,65536)
],#Differentpossiblevaluesfor`x_name`
line_arg='provider',#Argumentnamewhosevaluecorrespondstoadifferentlineintheplot
#Possiblevaluesfor`line_arg`
line_vals=['rocblas','triton'],
#Labelnameforthelines
line_names=["rocBLAS","Triton"],
#Linestyles
styles=[('green','-'),('blue','-')],
ylabel="TFLOPS",#Labelnameforthey-axis
plot_name="matmul-performance",#Namefortheplot,usedalsoasafilenameforsavingtheplot.
args={},
))
defbenchmark(M,N,K,provider):
a=torch.randn((M,K),device='cuda',dtype=torch.float16)
b=torch.randn((K,N),device='cuda',dtype=torch.float16)
quantiles=[0.5,0.2,0.8]
ifprovider=='rocblas':
ms,min_ms,max_ms=triton.testing.do_bench(lambda:torch.matmul(a,b),quantiles=quantiles)
ifprovider=='triton':
ms,min_ms,max_ms=triton.testing.do_bench(lambda:matmul(a,b),quantiles=quantiles)
globalverbose
ifverbose:
print(f'SIZE:{M},{N},{K}Besttuningconfig:({matmul_kernel.get_best_config()})')
perf=lambdams:2*M*N*K*1e-12/(ms*1e-3)
returnperf(ms),perf(max_ms),perf(min_ms)


defparse_args():
parser=argparse.ArgumentParser(
prog="GEMMtutorialexample",
allow_abbrev=False,
)

parser.add_argument("-v",action='store_true',default=False,help="Printoutthebesttuningconfig")
args=parser.parse_args()

returnargs


defmain():
#assigntoaglobalverbosevartoindicatewhetherprint
#besttuningconfig
globalverbose
args=parse_args()
verbose=args.v
benchmark.run(show_plots=True,print_data=True)

if__name__=='__main__':
sys.exit(main())

0x10 GEMM代碼詳細解讀

首先是對于搜索空間的定義，這里

@triton.autotune(
configs=[
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':64,'GROUP_SIZE_M':8},num_stages=3,
num_warps=8),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':32,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=4,
num_warps=4),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':32,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,
num_warps=2),
triton.Config({'BLOCK_SIZE_M':32,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8},num_stages=5,
num_warps=2),
]iftorch.version.hipisNoneelse[
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':16,'GROUP_SIZE_M':1,'waves_per_eu':2},
num_warps=4,num_stages=0),
triton.Config({'BLOCK_SIZE_M':256,'BLOCK_SIZE_N':256,'BLOCK_SIZE_K':16,'GROUP_SIZE_M':4,'waves_per_eu':2},
num_warps=8,num_stages=0),
triton.Config({'BLOCK_SIZE_M':128,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':1,'waves_per_eu':2},
num_warps=8,num_stages=0),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':128,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':8,'waves_per_eu':3},
num_warps=4,num_stages=0),
triton.Config({'BLOCK_SIZE_M':64,'BLOCK_SIZE_N':64,'BLOCK_SIZE_K':32,'GROUP_SIZE_M':1,'waves_per_eu':8},
num_warps=4,num_stages=0),
],
key=['M','N','K'],
)

其中的torch.version.hip走的就是AMD GPU所對應的搜索空間，我們看到其對應的可以tuning的knob，有最常規的BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, GROUP_SIZE_M外，還有了一個新的wave_per_eu，我一開始看到這個概念的時候也很陌生，隨后和AMD的技術人員請教了下，總結下來就是：

AMD GPU由計算單元(CU)組成,這相當于NVIDIA GPU上的流處理器(SM)。在每個CU中,有4個SIMD單元(也稱執行引擎或EU)。你可以把SIMD單元看成是一個矢量執行單元,它具有執行計算所需的一定數量的寄存器和ALUs。當你發起一個計算網格時,工作組(相當于NVIDIA GPU上的線程塊)會安排在CU上運行。

在CU中,波前(相當于NVIDIA GPU上的波紋)會安排在SIMD單元上運行。這里提出了occupancy的概念,它表示每個SIMD單元上可同時運行的波前數。這取決于每個波前需要的資源量和每個SIMD單元的資源量。waves_per_eu參數重點關注寄存器使用情況。例如,每個SIMD(EU)有512個寄存器。

如果每個波前需要256個寄存器,那么occupancy為2。但如果我們設置waves_per_eu=3,編譯器會試圖將每個波前的寄存器使用量減少到170,這樣occupancy就可以是3了。但是提高waves_per_eu存在寄存器溢出的風險和性能下降。所以增加waves_per_eu可能會增加occupancy,但不一定能提高性能。

然后是具體的kernel定義，這部分的定義其實和NV GPU上的寫法沒有本質區別

@triton.jit
defmatmul_kernel(
#Pointerstomatrices
a_ptr,b_ptr,c_ptr,
#Matrixdimensions
M,N,K,
#Thestridevariablesrepresenthowmuchtoincreasetheptrbywhenmovingby1
#elementinaparticulardimension.E.g.`stride_am`ishowmuchtoincrease`a_ptr`
#bytogettheelementonerowdown(AhasMrows).
stride_am,stride_ak,
stride_bk,stride_bn,
stride_cm,stride_cn,
#Meta-parameters
BLOCK_SIZE_M:tl.constexpr,BLOCK_SIZE_N:tl.constexpr,BLOCK_SIZE_K:tl.constexpr,
EVEN_K:tl.constexpr,
GROUP_SIZE_M:tl.constexpr,
ACTIVATION:tl.constexpr,
):
"""KernelforcomputingthematmulC=AxB.
Ahasshape(M,K),Bhasshape(K,N)andChasshape(M,N)
"""
#-----------------------------------------------------------
#Mapprogramids`pid`totheblockofCitshouldcompute.
#ThisisdoneinagroupedorderingtopromoteL2datareuse.
#Seeabove`L2CacheOptimizations`sectionfordetails.
pid=tl.program_id(axis=0)
num_pid_m=tl.cdiv(M,BLOCK_SIZE_M)
num_pid_n=tl.cdiv(N,BLOCK_SIZE_N)
ifGROUP_SIZE_M==1:
pid_m=pid//num_pid_n
pid_n=pid%num_pid_n
else:
num_pid_in_group=GROUP_SIZE_M*num_pid_n
group_id=pid//num_pid_in_group
first_pid_m=group_id*GROUP_SIZE_M
group_size_m=min(num_pid_m-first_pid_m,GROUP_SIZE_M)
pid_m=first_pid_m+(pid%group_size_m)
pid_n=(pid%num_pid_in_group)//group_size_m

#----------------------------------------------------------
#CreatepointersforthefirstblocksofAandB.
#WewilladvancethispointeraswemoveintheKdirection
#andaccumulate
#`a_ptrs`isablockof[BLOCK_SIZE_M,BLOCK_SIZE_K]pointers
#`b_ptrs`isablockof[BLOCK_SIZE_K,BLOCK_SIZE_N]pointers
#Seeabove`PointerArithmetics`sectionfordetails
offs_k=tl.arange(0,BLOCK_SIZE_K)
offs_am=(pid_m*BLOCK_SIZE_M+tl.arange(0,BLOCK_SIZE_M))%M
offs_bn=(pid_n*BLOCK_SIZE_N+tl.arange(0,BLOCK_SIZE_N))%N
a_ptrs=a_ptr+(offs_am[:,None]*stride_am+offs_k[None,:]*stride_ak)
b_ptrs=b_ptr+(offs_k[:,None]*stride_bk+offs_bn[None,:]*stride_bn)

#-----------------------------------------------------------
#IteratetocomputeablockoftheCmatrix.
#Weaccumulateintoa`[BLOCK_SIZE_M,BLOCK_SIZE_N]`block
#offp32valuesforhigheraccuracy.
#`accumulator`willbeconvertedbacktofp16aftertheloop.
accumulator=tl.zeros((BLOCK_SIZE_M,BLOCK_SIZE_N),dtype=tl.float32)
forkinrange(0,tl.cdiv(K,BLOCK_SIZE_K)):
#LoadthenextblockofAandB,generateamaskbycheckingtheKdimension.
#Ifitisoutofbounds,setitto0.
ifEVEN_K:
a=tl.load(a_ptrs)
b=tl.load(b_ptrs)
else:
a=tl.load(a_ptrs,mask=offs_k[None,:]

	

	接下來是單元測試，用來說明triton的輸出結果和torch的輸出結果必須是相同的

	
deftest_correctness(M,N,K,in_dtype,out_dtype):
torch.manual_seed(0)
a=torch.randn((M,K),device='cuda',dtype=torch.float16)
b=torch.randn((K,N),device='cuda',dtype=torch.float16)
triton_output=matmul(a,b)
torch_output=torch.matmul(a,b)
print(f"triton_output={triton_output}")
print(f"torch_output={torch_output}")
rtol=0iftorch.version.hipisNoneelse1e-2
iftorch.allclose(triton_output,torch_output,atol=1e-2,rtol=rtol):
print("TritonandTorchmatch")
else:
print("TritonandTorchdiffer")
asserttorch.allclose(triton_output,torch_output,atol=1e-2,rtol=rtol)


	

	接下來你只需要指定好對應的GEMM的尺寸，我們的默認輸入順序還是以M，N，K為主，剩下都是中規中局的操作了。

	
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['M','N','K'],#Argumentnamestouseasanx-axisfortheplot
x_vals=[
(1024,1024,1024),
(2048,2048,2048),
(4096,4096,4096),
(8192,8192,8192),
(9728,8192,65536)
],#Differentpossiblevaluesfor`x_name`
line_arg='provider',#Argumentnamewhosevaluecorrespondstoadifferentlineintheplot
#Possiblevaluesfor`line_arg`
line_vals=['rocblas','triton'],
#Labelnameforthelines
line_names=["rocBLAS","Triton"],
#Linestyles
styles=[('green','-'),('blue','-')],
ylabel="TFLOPS",#Labelnameforthey-axis
plot_name="matmul-performance",#Namefortheplot,usedalsoasafilenameforsavingtheplot.
args={},
))
defbenchmark(M,N,K,provider):
a=torch.randn((M,K),device='cuda',dtype=torch.float16)
b=torch.randn((K,N),device='cuda',dtype=torch.float16)
quantiles=[0.5,0.2,0.8]
ifprovider=='rocblas':
ms,min_ms,max_ms=triton.testing.do_bench(lambda:torch.matmul(a,b),quantiles=quantiles)
ifprovider=='triton':
ms,min_ms,max_ms=triton.testing.do_bench(lambda:matmul(a,b),quantiles=quantiles)
globalverbose
ifverbose:
print(f'SIZE:{M},{N},{K}Besttuningconfig:({matmul_kernel.get_best_config()})')
perf=lambdams:2*M*N*K*1e-12/(ms*1e-3)
returnperf(ms),perf(max_ms),perf(min_ms)


defparse_args():
parser=argparse.ArgumentParser(
prog="GEMMtutorialexample",
allow_abbrev=False,
)

parser.add_argument("-v",action='store_true',default=False,help="Printoutthebesttuningconfig")
args=parser.parse_args()

returnargs


defmain():
#assigntoaglobalverbosevartoindicatewhetherprint
#besttuningconfig
globalverbose
args=parse_args()
verbose=args.v
benchmark.run(show_plots=True,print_data=True)

if__name__=='__main__':
sys.exit(main())


	

	關于在AMD GPU上更加自動化的GEMM benchmark調優腳本，我們將在后面的章節中來為大家進行解讀。

	




	審核編輯：劉清

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。舉報投訴

amd

amd

+關注

關注
25

文章
5260

瀏覽量
132834
gpu

gpu

+關注

關注
27

文章
4478

瀏覽量
127153
Triton

Triton

+關注

關注
0

文章
16

瀏覽量
6997
python

python

+關注

關注
52

文章
4698

瀏覽量
83609
GPU芯片

GPU芯片

+關注

關注
1

文章
303

瀏覽量
5697
pytorch

pytorch

+關注

關注
2

文章
766

瀏覽量
12880
OpenAI

OpenAI

+關注

關注
9

文章
879

瀏覽量
6018

原文標題：OpenAI/Triton MLIR 第四章: ROCm-triton配置

文章出處：【微信號：GiantPandaCV，微信公眾號：GiantPandaCV】歡迎添加關注！文章轉載請注明出處。

評論

相關推薦

AMD 平臺配置推薦

AMD 平臺配置推薦配件品牌型號 CPU AMD羿龍三核 8450(盒) 內存金士頓 2G DDR2

發表于 01-11 21:03

VMware GPU分配/在GPU 1之前首先使用GPU 0

嗨，我在ESX 6中使用GRID K2。Atm有像GPU0和GPU1之間的負載平衡。如果我在主機上打開2 K240q，則第一個VM被分配給GPU

發表于 09-10 17:14

可以在vGPU配置文件上運行TensorFlow嗎？

進行了一些測試，但是這個過程耗盡了GPU，并且對于需要GPU的其他容器來說也很慢。1- 我可以在vGPU配置文件上運行TensorFlow嗎

發表于 09-18 16:35

單個GPU上不支持混合配置文件嗎？

有人可以確認單個GPU上不支持混合配置文件嗎？我知道這在K1 / K2 GPU中不受支持，但聽到有關M6 / M60 GPU的一些混合消息。

發表于 10-08 14:07

AMD迎頭猛追Intel 全球首發7nm GPU很威風！

`處理器大廠美商超微（AMD）日前發布全球首款7奈米制程資料中心繪圖處理器（GPU），為新世代人工智能（AI）、云端運算與高效能運算挹注動能。AMD Radeon Instinct MI60

發表于 11-20 11:35

在ubuntu18上是怎樣去安裝vscode的

code_1.51.0-1604600753_amd64.deb配置vscode(安裝插件)C/C++，這個肯定是必須的。C/C++ Snippets，即 C/C++重用代碼塊。C/C++ Advanced Lint,即 C/

發表于 11-04 06:35

在Ubuntu上使用Nvidia GPU訓練模型

問題最近在Ubuntu上使用Nvidia GPU訓練模型的時候，沒有問題，過一會再訓練出現非?？D，使用nvidia-smi查看發現，顯示GPU的風扇和電源報錯：解決方案自動風扇控制在

發表于 01-03 08:24

在RK3399開發板上運行Arm mali GPU驅動

、Ubuntu 這種發行版上默認搭配的 Gnome 或者 KDE，這種沒有 GPU 支持，所有的圖像合成渲染都要通過 CPU 來運算，是很難跑流暢的。所以如果你對圖形顯示功能比較看重，在選開發板的時候

發表于 07-27 15:43

在RK3399上運行開源的GPU驅動

的。所以如果你對圖形顯示功能比較看重，在選開發板的時候一定要查閱與之搭配的主控 SOC 是否帶有 GPU。也有很多人被卡在了第二個關卡，SOC 上搭配了強勁的 GPU、比如 RK339

發表于 10-20 17:44

無法在AMD Ryzen CPU上運行OpenVINO trade怎么解決？

在 AMD Ryzen CPU 上運行OpenVINO?推理。收到錯誤消息： libva error: vaGetDriverNameByIndex() failed with unknown libva error, dr

發表于 08-15 06:46

GTC2022大會黃仁勛：NVIDIA Triton是AI部署的“中央車站”

GTC2022大會黃仁勛：NVIDIA Triton是AI部署的“中央車站”，支持在每一代NVIDIA GPU、x86和ARM CPU上部署模型。

發表于 03-23 16:23 ?1333次閱讀

NVIDIA Triton推理服務器簡化人工智能推理

GKE 的 Triton 推理服務器應用程序是一個 helm chart 部署程序，可自動安裝和配置 Triton ，以便在具有 NVIDIA GP

發表于 04-08 16:43 ?1948次閱讀

NVIDIA Triton 系列文章（5）：安裝服務器軟件

在前一篇文章已經帶著讀者創建一個 Triton 的推理模型倉，現在只要安裝好服務器端與用戶端軟件，就能進行基本的測試與體驗。為了簡化過程，我們使用 NVIDIA Jetson AGX Orin

發表于 11-22 19:50 ?997次閱讀

NVIDIA Triton 系列文章（6）：安裝用戶端軟件

服務器執行推理計算的任務。由于用戶端的功能是在向服務器提出推理需求，本身并不參與計算，因此不用考慮設備性能或者是否裝載 GPU 設備，即便是一臺最基本的 Windows 上網本都能使用，只要安裝合適的用戶端軟件就可以。為了適應更多種使用場景，

發表于 11-29 19:20 ?892次閱讀

什么是Triton-shared？Triton-shared的安裝和使用

經過前面幾章關于triton在nv gpu上調優的講解，我們這章開始來看看triton的一個third_party庫，該庫是為了讓triton去支持更多其他的backend。該項目的地

發表于 12-19 09:47 ?562次閱讀

精選推薦
更多

文章

資料

帖子

如何移植FFmpeg

ElfBoard
4小時前

147 閱讀

基于Arm Cortex-CM85內核的RA8D1作為控制器通過MIPI DSI實現LVGL顯示

瑞薩MCU小百科
4小時前

235 閱讀

便攜式直流接地故障查找儀如何使用——每日了解電力知識

武漢摩恩
8小時前

279 閱讀

重磅！英特爾發布intel3制程至強6能效核處理器，賦能數據中心能效升級

章鷹觀察
9小時前

348 閱讀

Node-RED實現電表數據采集并上云

億佰特物聯網應用專家
11小時前

245 閱讀

視頻監控系統中矩陣切換-字符疊加實現方案

yezi888
168 KB

免費

97下載

雙足行走機器人步行原理是什么？步態軌跡規劃的詳細分析資料概述

尹鵬飛
0.42 MB

免費

101下載

Gimpanel Fcitx輸入法的GTK+前端

舊念
0.02 MB

免費

0下載

rigger Deis云部署工具

djelje
0.05 MB

免費

0下載

VOoM vim代碼輪轂插件

楊麗
0.15 MB

免費

1下載

I.MX6ULL-飛凌 ElfBoard ELF1板卡 - 如何在Ubuntu中編譯OpenCV庫(X86架構）

jf_25331175
19小時前

203 閱讀

鴻蒙原生應用元服務開發-設備管理USB服務開發場景與接口

李洋水蛟龍
19小時前

145 閱讀

嵌入式學習-飛凌ElfBoard ELF 1板卡 - 如何在Ubuntu中編譯OpenCV庫

jf_25331175
19小時前

270 閱讀

【留言有禮】佳節至，分享“粽”要記憶，傳遞溫馨祝福，贏取精美禮品！

ElecFans小喇叭
1天前

410 閱讀

無刷電機用的單電阻采樣的FOC，拿天線貼著電機線或者貼近采樣芯片，電機就會停轉

剛剛回國弄
1天前

611 閱讀

推薦專欄
更多

華秋（原“華強聚豐”）：

電子發燒友

華秋開發

華秋電路(原"華強PCB")

華秋商城(原"華強芯城")

華秋智造

My ElecFans

APP
網站地圖

設計技術

可編程邏輯

電源/新能源

MEMS/傳感技術

測量儀表

嵌入式技術

制造/封裝

模擬技術

RF/無線

接口/總線/驅動

處理器/DSP

EDA/IC設計

存儲技術

光電顯示

EMC/EMI設計

連接器

行業應用

LEDs

汽車電子

音視頻及家電

通信網絡

醫療電子

人工智能

虛擬現實

可穿戴設備

機器人

安全設備/系統

軍用/航空電子

移動通信

工業控制

便攜設備

觸控感測

物聯網

智能電網

區塊鏈

新科技

特色內容

專欄推薦

學院

設計資源

設計技術

電子百科

電子視頻

元器件知識

工具箱

VIP會員

最新技術文章

社區

小組

論壇

問答

評測試用

企業服務

產品

資料

文章

方案

企業

供應鏈服務

硬件開發

華秋電路

華秋商城

華秋智造

nextPCB

BOM配單

媒體服務

網站廣告

在線研討會

活動策劃

新聞發布

新品發布

小測驗

設計大賽

華秋

關于我們

投資關系

新聞動態

加入我們

聯系我們

舉報投訴

社交網絡

微博

移動端

發燒友APP

硬聲APP

WAP

聯系我們

廣告合作

王婉珠：wangwanzhu@elecfans.com

內容合作

黃晶晶：huangjingjing@elecfans.com

內容合作（海外）

張迎輝：mikezhang@elecfans.com

供應鏈服務 PCB/IC/PCBA

江良華：lanhu@huaqiu.com

投資合作

曾海銀：zenghaiyin@huaqiu.com

社區合作

劉勇：liuyong@huaqiu.com

關注我們的微信

下載發燒友APP

電子發燒友觀察

電子工程師社區

1-32層PCB打樣·中小批量

元器件現貨·全球代購·SmartBOM

SMT貼片·PCBA加工

PCB Manufacturer

華秋簡介

企業動態

聯系我們

企業文化

企業宣傳片

加入我們

版權所有 ? 湖南華秋數字科技有限公司
電子發燒友 （電路圖） 湘公網安備43011202000918 電信與信息服務業務經營許可證：合字B2-20210191 工商網監湘ICP備 2023018690 號

亚洲欧美日韩精品久久_久久精品AⅤ无码中文_日本中文字幕有码在线播放_亚洲视频高清不卡在线观看