<acronym id="s8ci2"><small id="s8ci2"></small></acronym>

<rt id="s8ci2"></rt><rt id="s8ci2"><optgroup id="s8ci2"></optgroup></rt>

<acronym id="s8ci2"></acronym>

<acronym id="s8ci2"><center id="s8ci2"></center></acronym>

搜索歷史

清空

搜索熱詞

0

聊天消息
系統消息
評論與回復

查看更多

查看更多

查看更多

VIP于到期續費

登錄后你可以

下載海量資料
學習在線課程
觀看技術視頻
寫文章/發帖/加入社區

會員中心

創作中心

發布

創作活動

完善資料讓更多小伙伴認識你，還能領取20積分哦，立即完善>

3天內不再提示

TensorRT-LLM初探（一）運行llama

前文

TensorRT-LLM正式出來有半個月了，一直沒有時間玩，周末趁著有時間跑一下。

之前玩內測版的時候就需要cuda-12.x，正式出來仍是需要cuda-12.x，主要是因為tensorr-llm中依賴的CUBIN（二進制代碼）是基于cuda12.x編譯生成的，想要跑只能更新驅動。

因此，想要快速跑TensorRT-LLM，建議直接將nvidia-driver升級到535.xxx，利用docker跑即可，省去自己折騰環境， 至于想要自定義修改源碼，也在docker中搞就可以 。

理論上替換原始代碼中的該部分就可以使用別的cuda版本了（batch manager只是不開源，和cuda版本應該沒關系，主要是FMA模塊，另外TensorRT-llm依賴的TensorRT有cuda11.x版本，配合inflight_batcher_llm跑的triton-inference-server也和cuda12.x沒有強制依賴關系）：

tensorrt-llm中預先編譯好的部分

說完環境要求，開始配環境吧！

搭建運行環境以及庫

首先拉取鏡像，宿主機顯卡驅動需要高于等于535：

docker pull nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

這個鏡像是前幾天剛出的，包含了運行TensorRT-LLM的所有環境（TensorRT、mpi、nvcc、nccl庫等等），省去自己配環境的煩惱。

拉下來鏡像后，啟動鏡像：

docker run -it -d --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --security-opt seccomp=unconfined --gpus=all --shm-size=16g --privileged --ulimit memlock=-1 --name=develop nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash

接下來的操作全在這個容器里。

編譯tensorrt-llm

首先獲取git倉庫，因為這個鏡像中 只有運行需要的lib ，模型還是需要自行編譯的（因為依賴的TensorRT，用過trt的都知道需要構建engine），所以首先編譯tensorrRT-LLM：

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs install
git lfs pull

然后進入倉庫進行編譯：

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt

一般不會有環境問題，這個docekr中已經包含了所有需要的包，執行build_wheel的時候會按照腳本中的步驟pip install一些需要的包，然后運行cmake和make編譯文件：

..
adding 'tensorrt_llm/tools/plugin_gen/templates/functional.py.tpl'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin.cpp.tpl'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin.h.tpl'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin_common.cpp'
adding 'tensorrt_llm/tools/plugin_gen/templates/plugin_common.h'
adding 'tensorrt_llm/tools/plugin_gen/templates/tritonPlugins.cpp.tpl'
adding 'tensorrt_llm-0.5.0.dist-info/LICENSE'
adding 'tensorrt_llm-0.5.0.dist-info/METADATA'
adding 'tensorrt_llm-0.5.0.dist-info/WHEEL'
adding 'tensorrt_llm-0.5.0.dist-info/top_level.txt'
adding 'tensorrt_llm-0.5.0.dist-info/zip-safe'
adding 'tensorrt_llm-0.5.0.dist-info/RECORD'
removing build/bdist.linux-x86_64/wheel
Successfully built tensorrt_llm-0.5.0-py3-none-any.whl

然后pip install tensorrt_llm-0.5.0-py3-none-any.whl即可。

運行

首先編譯模型，因為最近沒有下載新模型，還是拿舊的llama做例子。其實吧，其他llm也一樣（chatglm、qwen等等），只要trt-llm支持，編譯運行方法都一樣的，在hugging face下載好要測試的模型即可。

這里我執行：

python /work/code/TensorRT-LLM/examples/llama/build.py 
                --model_dir /work/models/GPT/LLAMA/llama-7b-hf   # 可以替換為你自己的llm模型
                --dtype float16 
                --remove_input_padding 
                --use_gpt_attention_plugin float16 
                --enable_context_fmha 
                --use_gemm_plugin float16 
                --use_inflight_batching   # 開啟inflight batching
                --output_dir /work/trtModel/llama/1-gpu

然后就是TensorRT的編譯、構建engine的過程（因為使用了plugin，編譯挺快的，這里我只用了一張A4000，所以沒有設置world_size，默認為1），這里有很多細節，后續會聊。

編譯好engine后，會生成/work/trtModel/llama/1-gpu，后續會用到。

執行以下命令：

cd tensorrtllm_backend
mkdir triton_model_repo

# 拷貝出來模板模型文件夾
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# 將剛才生成好的`/work/trtModel/llama/1-gpu`移動到模板模型文件夾中
cp /work/trtModel/llama/1-gpu/* triton_model_repo/tensorrt_llm/1

設置好之后進入tensorrtllm_backend執行：

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=triton_model_repo

順利的話就會輸出：

root@6aaab84e59c0:/work/code/tensorrtllm_backend# I1105 14:16:58.286836 2561098 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7ffb76000000' with size 268435456
I1105 14:16:58.286973 2561098 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1105 14:16:58.288120 2561098 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1105 14:16:58.288135 2561098 model_lifecycle.cc:461] loading: preprocessing:1
I1105 14:16:58.288142 2561098 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1105 14:16:58.392915 2561098 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
I1105 14:16:58.392979 2561098 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
I1105 14:16:58.732165 2561098 model_lifecycle.cc:818] successfully loaded 'postprocessing'
I1105 14:16:59.383255 2561098 model_lifecycle.cc:818] successfully loaded 'preprocessing'
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 12856 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13144, GPU 13111 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 13146, GPU 13121 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13164, GPU 14363 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13164, GPU 14371 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13198, GPU 14391 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13198, GPU 14401 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB)
[TensorRT-LLM][INFO] Using 2878 tokens in paged KV cache.
I1105 14:17:17.299293 2561098 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1105 14:17:17.303661 2561098 model_lifecycle.cc:461] loading: ensemble:1
I1105 14:17:17.305897 2561098 model_lifecycle.cc:818] successfully loaded 'ensemble'
I1105 14:17:17.306051 2561098 server.cc:592] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1105 14:17:17.306401 2561098 server.cc:619] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                               |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-com |
|             |                                                                 | pute-capability":"6.000000","default-max-batch-size":"4"}}                                           |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-com |
|             |                                                                 | pute-capability":"6.000000","shm-region-prefix-name":"prefix0_","default-max-batch-size":"4"}}       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+

I1105 14:17:17.307053 2561098 server.cc:662] 
+----------------+---------+--------+
| Model          | Version | Status |
+----------------+---------+--------+
| ensemble       | 1       | READY  |
| postprocessing | 1       | READY  |
| preprocessing  | 1       | READY  |
| tensorrt_llm   | 1       | READY  |
+----------------+---------+--------+

I1105 14:17:17.393318 2561098 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA RTX A4000
I1105 14:17:17.393534 2561098 metrics.cc:710] Collecting CPU metrics
I1105 14:17:17.394550 2561098 tritonserver.cc:2458] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                              |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                             |
| server_version                   | 2.39.0                                                                                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_ |
|                                  | memory binary_tensor_data parameters statistics trace logging                                                                                      |
| model_repository_path[0]         | /work/triton_models/inflight_batcher_llm                                                                                                           |
| model_control_mode               | MODE_NONE                                                                                                                                          |
| strict_model_config              | 1                                                                                                                                                  |
| rate_limit                       | OFF                                                                                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                           |
| min_supported_compute_capability | 6.0                                                                                                                                                |
| strict_readiness                 | 1                                                                                                                                                  |
| exit_timeout                     | 30                                                                                                                                                 |
| cache_enabled                    | 0                                                                                                                                                  |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

I1105 14:17:17.423479 2561098 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1105 14:17:17.424418 2561098 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000

這時也就啟動了triton-inference-server，后端就是TensorRT-LLM。

可以看到LLAMA-7B-FP16精度版本，占用顯存為：

+---------------------------------------------------------------------------------------+
Sun Nov  5 14:20:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A4000               Off | 00000000:01:00.0 Off |                  Off |
| 41%   34C    P8              16W / 140W |  15855MiB / 16376MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

客戶端

然后我們請求一下吧，先走http接口：

# 執行
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

# 得到返回結果
{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" ?  What is machine learning? Machine learning is a subfield of computer science that focuses on the development of algorithms that can learn"}

triton目前不支持SSE方法，想stream可以使用grpc協議，官方也提供了grpc的方法，首先安裝triton客戶端：

pip install tritonclient[all]

然后執行：

python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /work/models/GPT/LLAMA/llama-7b-hf --tokenizer_type llama --streaming

請求后可以看到是一個token一個token返回的，也就是我們使用chatgpt3.5時，一個字一個字蹦的意思：

... 
[29953]
[29941]
[511]
[450]
[315]
[4664]
[457]
[310]
output_ids =  [[0, 19298, 297, 6641, 29899, 23027, 3444, 29892, 1105, 7598, 16370, 408, 263, 14547, 297, 3681, 1434, 8401, 304, 4517, 297, 29871, 29896, 29947, 29946, 29955, 29889, 940, 3796, 472, 278, 23933, 5977, 322, 278, 7021, 16923, 297, 29258, 265, 1434, 8718, 670, 1914, 27144, 297, 29871, 29896, 29947, 29945, 29896, 29889, 940, 471, 263, 29323, 261, 310, 278, 671, 310, 21837, 7984, 292, 322, 471, 278, 937, 304, 671, 263, 10489, 380, 994, 29889, 940, 471, 884, 263, 410, 29880, 928, 9227, 322, 670, 8277, 5134, 450, 315, 4664, 457, 310, 3444, 313, 29896, 29947, 29945, 29896, 511, 450, 315, 4664, 457, 310, 12730, 313, 29896, 29947, 29945, 29946, 511, 450, 315, 4664, 457, 310, 13616, 313, 29896, 29947, 29945, 29945, 511, 450, 315, 4664, 457, 310, 9556, 313, 29896, 29947, 29945, 29955, 511, 450, 315, 4664, 457, 310, 17362, 313, 29896, 29947, 29945, 29947, 511, 450, 315, 4664, 457, 310, 12710, 313, 29896, 29947, 29945, 29929, 511, 450, 315, 4664, 457, 310, 14198, 653, 313, 29896, 29947, 29953, 29900, 511, 450, 315, 4664, 457, 310, 28806, 313, 29896, 29947, 29953, 29896, 511, 450, 315, 4664, 457, 310, 27440, 313, 29896, 29947, 29953, 29906, 511, 450, 315, 4664, 457, 310, 24506, 313, 29896, 29947, 29953, 29941, 511, 450, 315, 4664, 457, 310]]
Input: Born in north-east France, Soyer trained as a
Output:  chef in Paris before moving to London in 1 847. He worked at the Reform Club and the Royal Hotel in Brighton before opening his own restaurant in 1 851 . He was a pioneer of the use of steam cooking and was the first to use a gas stove. He was also a prolific writer and his books included The Cuisine of France (1 851 ), The Cuisine of Italy (1 854), The Cuisine of Spain (1 855), The Cuisine of Germany (1 857), The Cuisine of Austria (1 858), The Cuisine of Russia (1 859), The Cuisine of Hungary (1 860), The Cuisine of Switzerland (1 861 ), The Cuisine of Norway (1 862), The Cuisine of Sweden (1863), The Cuisine of

因為開了inflight batching，其實可以同時多個請求打過來，修改request_id不要一樣就可以：

# user 1
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /work/models/GPT/LLAMA/llama-7b-hf --tokenizer_type llama --streaming --request_id 1
# user 2
python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer_dir /work/models/GPT/LLAMA/llama-7b-hf --tokenizer_type llama --streaming --request_id 2

至此就快速過完整個TensorRT-LLM的運行流程。

使用建議

非常建議使用docker，人生苦短。

在我們實際使用中，vllm在batch較大的場景并不慢，利用率也能打滿。TensorRT-LLM和vllm的速度在某些模型上快某些模型上慢，各有優劣。

TensorRT-LLM的特點就是借助TensorRT，TensorRT后續更新越快，支持特性越牛逼，TensorRT-LLM也就越牛逼。靈活性上，我感覺vllm和TensorRT-LLM不分上下，加上大模型的結構其實都差不多，甚至TensorRT-LLM都沒有上onnx-parser，在后續更新模型上，python快速搭建模型效率也都差不了多少。

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。舉報投訴

python

python

+關注

關注
52

文章
4698

瀏覽量
83616
GPU芯片

GPU芯片

+關注

關注
1

文章
303

瀏覽量
5697
HTTP接口

HTTP接口

+關注

關注
0

文章
21

瀏覽量
1662
ChatGPT

ChatGPT

+關注

關注
28

文章
1481

瀏覽量
5517

評論

相關推薦

基于LLAMA的魔改部署

? 借著熱點，簡單聊聊大模型的部署方案，作為一個只搞過CV部署的算法工程師，在最近LLM逐漸改變生活的大背景下，猛然意識到LLM部署也是很重要的。大模型很火，而且確實有用（很多垂類場景可以針對

的頭像

發表于 05-23 15:08 ?4714次閱讀

基于<b class='flag-5'>LLAMA</b>的魔改部署

Meta推出Llama 2 免費開放商業和研究機構使用

與所有LLM一樣，Llama 2偶爾會產生不正確或不可用的答案，但Meta介紹Llama的論文聲稱，它在學術基準方面與OpenAI的GPT 3.5不相上下，如MMLU（衡量LLM在57

的頭像

發表于 08-02 16:17 ?504次閱讀

Meta推出<b class='flag-5'>Llama</b> 2 免費開放商業和研究機構使用

深入理解Llama模型的源碼案例

目前大部分開源LLM模型都是基于transformers庫來做的，它們的結構大部分都和Llama大同小異。

發表于 08-23 11:44 ?1955次閱讀

深入理解<b class='flag-5'>Llama</b>模型的源碼案例

關于Llama 2的一切資源，我們都幫你整理好了

Meta 發布的 Llama 2，是新的 SOTA 開源大型語言模型（LLM）。Llama 2 代表著 LLaMA 的下一代版本，可商用。Llama

發表于 08-23 15:40 ?838次閱讀

Meta發布一款可以使用文本提示生成代碼的大型語言模型Code Llama

今天，Meta發布了Code Llama，一款可以使用文本提示生成代碼的大型語言模型（LLM）。

發表于 08-25 09:06 ?1051次閱讀

阿里云 & NVIDIA TensorRT Hackathon 2023 決賽圓滿收官，26 支 AI 團隊嶄露頭角

及優勝獎，展現出了卓越的技術實力。掃碼查看獲獎名單解鎖 NVIDIA TensorRT-LLM 挖掘生成式 AI 新需求今年的 NVIDIA TensorRT Hackat

發表于 10-17 03:20 ?377次閱讀

周四研討會預告 | 注冊報名 NVIDIA AI Inference Day - 大模型推理線上研討會

由 CSDN 舉辦的 NVIDIA AI Inference Day - 大模型推理線上研討會，將幫助您了解 NVIDIA 開源大型語言模型（LLM）推理加速庫 TensorRT-LLM ?及其功能

發表于 10-26 09:05 ?219次閱讀

現已公開發布！歡迎使用 NVIDIA TensorRT-LLM 優化大語言模型推理

NVIDIA 于 2023 年 10 月 19 日公開發布 TensorRT-LLM ，可在 NVIDIA GPU 上加速和優化最新的大語言模型（Large Language Models）的推理性

發表于 10-27 20:05 ?612次閱讀

LLM的Transformer是否可以直接處理視覺Token？

多種LLM Transformer都可以提升Visual Encoding。例如用LLaMA和OPT的不同Transformer層都會有提升，而且不同層之間也會體現不同的規律。

發表于 11-03 14:10 ?299次閱讀

淺析tensorrt-llm搭建運行環境以及庫

之前玩內測版的時候就需要cuda-12.x，正式出來仍是需要cuda-12.x，主要是因為tensorr-llm中依賴的CUBIN（二進制代碼）是基于cuda12.x編譯生成的，想要跑只能更新驅動。

發表于 11-13 14:42 ?1994次閱讀

點亮未來：TensorRT-LLM 更新加速 AI 推理性能，支持在 RTX 驅動的 Windows PC 上運行新模型

微軟 Ignite 2023 技術大會發布的新工具和資源包括 OpenAI?Chat?API 的 TensorRT-LLM 封裝接口、RTX 驅動的性能改進 DirectML?for?Llama

發表于 11-16 21:15 ?500次閱讀

優于10倍參數模型！微軟發布Orca 2 LLM

微軟發布 Orca 2 LLM，這是 Llama 2 的一個調優版本，性能與包含 10 倍參數的模型相當，甚至更好。

發表于 12-26 14:23 ?346次閱讀

LLaMA 2是什么？LLaMA 2背后的研究工作

Meta 發布的 LLaMA 2，是新的 sota 開源大型語言模型 (LLM)。LLaMA 2 代表著 LLaMA 的下一代版本，并且具有商業許可證。

發表于 02-21 16:00 ?518次閱讀

NVIDIA加速微軟最新的Phi-3 Mini開源語言模型

NVIDIA 宣布使用 NVIDIA TensorRT-LLM 加速微軟最新的 Phi-3 Mini 開源語言模型。TensorRT-LLM 是一個開源庫，用于優化從 PC 到云端的 NVIDIA GPU 上運行的大語言模型推理

發表于 04-28 10:36 ?212次閱讀

高通支持Meta Llama 3在驍龍終端上運行

高通與Meta攜手合作，共同推動Meta的Llama 3大語言模型（LLM）在驍龍驅動的各類終端設備上實現高效運行。此次合作致力于優化Llama 3在智能手機、個人電腦、VR/AR頭顯

發表于 05-09 10:37 ?167次閱讀

精選推薦
更多

文章

資料

帖子

Open AI和蘋果合作，將AI大模型植入手機/土耳其對中國進口汽車加征40%關稅熱點科技新聞點評

章鷹觀察
52分鐘前

125 閱讀

三分鐘帶你了解熱電阻參數選型

ZLG致遠電子
3天前

222 閱讀

TSMaster 的 CAN UDS 診斷操作指南（上）

上海同星智能科技有限公司
3天前

198 閱讀

采用144核，能效提升66%！英特爾至強6處理器震撼上市，加速數據中心升級

章鷹觀察
1天前

1223 閱讀

從原理到應用，800字搞定達林頓晶體管電路

硬件那點事兒
22分鐘前

68 閱讀

超長距離光纖通信系統中的新型技術

yqdedli
698

免費

24下載

長虹彩電C2588PK電路圖紙（12張）

1652711011.026500
929 KB

免費

586下載

CGDB調試器的中文幫助手冊

趙敏
0.02 MB

免費

1下載

Observatory Mozilla網站安全分析工具

趙敏
1.17 MB

免費

1下載

SwooleWorker分布式長連接開發框架

王莉
0.07 MB

免費

0下載

我用香橙派做了一個Klipper 3D打印控制器

corkia
15小時前

90 閱讀

功放自激問題如何解決？

jf_51831565
1天前

205 閱讀

labview datasocket綁定點擊瀏覽選擇dstp服務器然后一直轉圈圈選不了怎么回事啊各位大佬？

jf_20760356
2天前

343 閱讀

在做基于fpga的數字示波器這個項目時，我用的是vivado平臺，遇到了顯示相關的問題。

jf_66683878
2天前

325 閱讀

求教：labview無法連接除本機外的藍牙設備

jf_00903516
1天前

393 閱讀

推薦專欄
更多

華秋（原“華強聚豐”）：

電子發燒友

華秋開發

華秋電路(原"華強PCB")

華秋商城(原"華強芯城")

華秋智造

My ElecFans

APP
網站地圖

設計技術

可編程邏輯

電源/新能源

MEMS/傳感技術

測量儀表

嵌入式技術

制造/封裝

模擬技術

RF/無線

接口/總線/驅動

處理器/DSP

EDA/IC設計

存儲技術

光電顯示

EMC/EMI設計

連接器

行業應用

LEDs

汽車電子

音視頻及家電

通信網絡

醫療電子

人工智能

虛擬現實

可穿戴設備

機器人

安全設備/系統

軍用/航空電子

移動通信

工業控制

便攜設備

觸控感測

物聯網

智能電網

區塊鏈

新科技

特色內容

專欄推薦

學院

設計資源

設計技術

電子百科

電子視頻

元器件知識

工具箱

VIP會員

最新技術文章

社區

小組

論壇

問答

評測試用

企業服務

產品

資料

文章

方案

企業

供應鏈服務

硬件開發

華秋電路

華秋商城

華秋智造

nextPCB

BOM配單

媒體服務

網站廣告

在線研討會

活動策劃

新聞發布

新品發布

小測驗

設計大賽

華秋

關于我們

投資關系

新聞動態

加入我們

聯系我們

舉報投訴

社交網絡

微博

移動端

發燒友APP

硬聲APP

WAP

聯系我們

廣告合作

王婉珠：wangwanzhu@elecfans.com

內容合作

黃晶晶：huangjingjing@elecfans.com

內容合作（海外）

張迎輝：mikezhang@elecfans.com

供應鏈服務 PCB/IC/PCBA

江良華：lanhu@huaqiu.com

投資合作

曾海銀：zenghaiyin@huaqiu.com

社區合作

劉勇：liuyong@huaqiu.com

關注我們的微信

下載發燒友APP

電子發燒友觀察

電子工程師社區

1-32層PCB打樣·中小批量

元器件現貨·全球代購·SmartBOM

SMT貼片·PCBA加工

PCB Manufacturer

華秋簡介

企業動態

聯系我們

企業文化

企業宣傳片

加入我們

版權所有 ? 湖南華秋數字科技有限公司
電子發燒友 （電路圖） 湘公網安備43011202000918 電信與信息服務業務經營許可證：合字B2-20210191 工商網監湘ICP備 2023018690 號

亚洲欧美日韩精品久久_久久精品AⅤ无码中文_日本中文字幕有码在线播放_亚洲视频高清不卡在线观看