Environment:

aupxtx@aupxtx:~$ python3 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.3.0+rocm5.7
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.7.31921-d1770ee1b

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Radeon RX 7900 XTX (gfx1100)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.7.31921
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          16
On-line CPU(s) list:             0-15
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 7 7800X3D 8-Core Processor
CPU family:                      25
Model:                           97
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
Stepping:                        2
Frequency boost:                 enabled
CPU max MHz:                     5049.0229
CPU min MHz:                     3000.0000
BogoMIPS:                        8399.69
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                  AMD-V
L1d cache:                       256 KiB (8 instances)
L1i cache:                       256 KiB (8 instances)
L2 cache:                        8 MiB (8 instances)
L3 cache:                        96 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] pytorch-triton-rocm==2.3.0
[pip3] torch==2.3.0+rocm5.7
[pip3] torchaudio==2.3.0+rocm5.7
[pip3] torchvision==0.18.0+rocm5.7

Opt Log

1. Find bad performance on 7900XTX

Test with llama-7b-relu.powerinfer.gguf and llama-7b-relu.q4.powerinfer.gguf

7900 result without quant:

llama_print_timings:        load time =    1408.45 ms
llama_print_timings:      sample time =      12.29 ms /   128 runs   (    0.10 ms per token, 10419.21 tokens per second)
llama_print_timings: prompt eval time =      74.36 ms /     5 tokens (   14.87 ms per token,    67.24 tokens per second)
llama_print_timings:        eval time =    4499.95 ms /   127 runs   (   35.43 ms per token,    28.22 tokens per second)
llama_print_timings:       total time =    4601.82 ms

4090 result without quant:

llama_print_timings:        load time =    6310.50 ms
llama_print_timings:      sample time =      14.17 ms /   128 runs   (    0.11 ms per token,  9034.44 tokens per second)
llama_print_timings: prompt eval time =      18.50 ms /     5 tokens (    3.70 ms per token,   270.20 tokens per second)
llama_print_timings:        eval time =    1609.02 ms /   127 runs   (   12.67 ms per token,    78.93 tokens per second)
llama_print_timings:       total time =    1660.44 ms

7900 result with Q4 quant:

llama_print_timings:        load time =     495.59 ms
llama_print_timings:      sample time =      12.43 ms /   128 runs   (    0.10 ms per token, 10301.81 tokens per second)
llama_print_timings: prompt eval time =      80.97 ms /     5 tokens (   16.19 ms per token,    61.75 tokens per second)
llama_print_timings:        eval time =    2831.35 ms /   127 runs   (   22.29 ms per token,    44.85 tokens per second)
llama_print_timings:       total time =    2935.92 ms

4090 result with Q4 quant:

llama_print_timings:        load time =     428.45 ms
llama_print_timings:      sample time =      15.35 ms /  128 runs  (  0.12 ms per token, 8337.68 tokens per second)
llama_print_timings: prompt eval time =      14.47 ms /   5 tokens (  2.89 ms per token,  345.57 tokens per second)
llama_print_timings:        eval time =    898.83 ms /  127 runs  (  7.08 ms per token,  141.29 tokens per second)
llama_print_timings:       total time =    947.09 ms
Llama + 4090 Llama + 7900 PowerInfer + 4090 PowerInfer + 7900
Performance 79 tokens/s 61 tokens/s 78.93 tokens/s 28.22 tokens/s
Performance with Q4 quant 140 tokens/s 110 tokens/s 141.29 tokens/s 44.85 tokens/s

2. Profile with ROCprofv2

$ rocprofv2 -h
ROCProfilerV2 Run Script Usage:
-h   | --help                 For showing this message
--list-counters               For showing all available counters for the current GPUs
-m                            For providing an absolute path of a custom metrics file
--basenames                   For Truncating the kernel names
--hip-api                     For Collecting HIP API Traces
--hip-activity | --hip-trace  For Collecting HIP API Activities Traces
--hsa-api                     For Collecting HSA API Traces
--hsa-activity | --hsa-trace  For Collecting HSA API Activities Traces
--roctx-trace                 For Collecting ROCTx Traces
--kernel-trace                For Collecting Kernel dispatch Traces
--sys-trace                   For Collecting HIP and HSA APIs and their Activities Traces along ROCTX and Kernel Dispatch traces

        #usage e.g: rocprofv2 --[hip-trace|hsa-trace|roctx-trace|kernel-trace|sys-trace]  <executable>

--plugin  PLUGIN_NAME         For enabling a plugin (cli/file/perfetto/att/ctf)
        # usage(file/perfetto/ctf) e.g: rocprofv2 -i pmc.txt --plugin [file/perfetto/ctf] -d out_dir <executable>
        # usage(att): rocprofv2 <rocprofv2_params> --plugin att <ISA_file> <att_parameters> <executable>
        # use "rocprofv2 --plugin att --help" for ATT-specific parameters help.

--plugin-version  <1|2>       For selecting the version for the plugin (1/2)
        # 1 - Legacy output format, 2 - New output format (default)

-i   | --input                For adding counters file path (every line in the text file represents a counter)
        # usage: rocprofv2 -i pmc.txt -d <executable>

-o   | --output-file          For the output file name
        # usage e.g:(with current dir): rocprofv2 --hip-trace -o <file_name> <executable>
        # usage e.g:(with custom dir):  rocprofv2 --hip-trace -d <out_dir> -o <file_name> <executable>

-d   | --output-directory     For adding output path where the output files will be saved
        # usage e.g:(with custom dir):  rocprofv2 --hip-trace -d <out_dir> <executable>

-fi  | --flush-interval       For adding a flush interval in milliseconds, every "flush interval" the buffers will be flushed
        # usage e.g:  rocprofv2 --hip-trace -fi 1000 <executable>

-tp  | --trace-period        Specifies a trace period in milliseconds, with format "-tp <DELAY>:<ACTIVE_TIME>:<LOOP_RESET_TIME>".
        # usage e.g:  rocprofv2 --hip-trace -tp 1000:2000:4000 <executable>

I used this script for profiling:

#!/bin/bash

rm -rf build
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100
cmake --build build --config Release
rocprofv2 -d ./profile/ --hip-trace --hip-api --plugin perfetto ./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf --ignore-eos -n 256 --seed 0 --top-k 1 --reset-gpu-index -t 8 -p "Once"

First I got this result:

Opt1

Opt2

We can see that hipMemcpyAsync hip-api and CopyHostToDevice event dominated the program.

However, as theoretically analysis and contexts show:

total VRAM used: 22793.02 MB (model: 14195.52 MB, context: 341.50 MB)

VRAM used < 7900xtx VRAM(24G)

The model and all info should be loaded to GPU at the beginning rather too many Memcpy invokes at latter stage.

So we need to confirm whether this happen as we expected.

3. Backend detection

I refered with tensor architecture and use the code snippet after the end of the load period to insert and print Tensor backend device information to monitor whether the data has been loaded into the GPU as expected:

if (src->backend == GGML_BACKEND_CPU) {printf(src->name);printf(" src_CPU\n");} 
else if (src->backend == GGML_BACKEND_GPU || src->backend == GGML_BACKEND_GPU_SPLIT) {printf(src->name);printf(" src_GPU\n");}

if (dst->backend == GGML_BACKEND_CPU) {printf(dst->name);printf(" dst_CPU\n");} 
else if (dst->backend == GGML_BACKEND_GPU || dst->backend == GGML_BACKEND_GPU_SPLIT) {printf(dst->name);printf(" dst_GPU\n");}

And the result:

src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
dst_GPU
src0_GPU
src1_GPU
result_output dst_CPU

This showed that these tensors are indeed loaded to GPU(Last CPU is used to output info).

Besides, I also analyzed with Computation Graph:

7900
=== GRAPH ===
n_nodes = 1220
 -   0: [  4096,     1,     1]         GET_ROWS                                         inp_embd   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms
 -   1: [  4096,     1,     1]         RMS_NORM                                           norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.034 /   0.034 ms
 -   2: [  4096,     1,     1]              MUL                                      attn_norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -   3: [  4096,     1,     1]          MUL_MAT                                           Vcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.097 /   0.097 ms
 -   4: [  4096,     1,     1]          RESHAPE                                Vcur-0 (reshaped)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms
 -   5: [     1,  4096,     1]        TRANSPOSE                                        v_cur_t-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -   6: [     1,  4096,     1]             VIEW                                   v_cache_view-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -   7: [     1,  4096,     1]              CPY               v_cache_view-0 (copy of v_cur_t-0)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -   8: [    32,   128,    32]             VIEW                                              v-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms
 -   9: [  4096,     1,     1]          MUL_MAT                                           Kcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.088 /   0.088 ms
 -  10: [   128,    32,     1]          RESHAPE                                Kcur-0 (reshaped)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  11: [   128,    32,     1]             ROPE                                           Kcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -  12: [  4096,     1,     1]             VIEW                                   k_cache_view-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  13: [  4096,     1,     1]              CPY                  k_cache_view-0 (copy of Kcur-0)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms
 -  14: [   128,    32,    32]             VIEW                                              k-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  15: [  4096,     1,     1]          MUL_MAT                                           Qcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.102 /   0.102 ms
 -  16: [   128,    32,     1]          RESHAPE                                Qcur-0 (reshaped)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  17: [   128,    32,     1]             ROPE                                           Qcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms
 -  18: [   128,     1,    32]          PERMUTE                                              q-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  19: [    32,     1,    32]          MUL_MAT                                             kq-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.029 /   0.029 ms
 -  20: [    32,     1,    32]            SCALE                                      kq_scaled-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.029 /   0.029 ms
 -  21: [    32,     1,    32]              ADD                                      kq_masked-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -  22: [    32,     1,    32]         SOFT_MAX                                    kq_soft_max-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.022 ms
 -  23: [   128,     1,    32]          MUL_MAT                                            kqv-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.030 /   0.030 ms
 -  24: [   128,    32,     1]          PERMUTE                                     kqv_merged-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  25: [  4096,     1,     1]             CONT                                kqv_merged_cont-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -  26: [  4096,     1,     1]          MUL_MAT                                        kqv_out-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.090 /   0.090 ms
 -  27: [  4096,     1,     1]              ADD                                        ffn_inp-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.026 /   0.026 ms
 -  28: [  4096,     1,     1]         RMS_NORM                                           norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.022 ms
 -  29: [  4096,     1,     1]              MUL                                       ffn_norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.029 /   0.029 ms
 -  30: [  1024,     1,     1]          MUL_MAT                                 mlp_pre_hidden-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.051 /   0.051 ms
 -  31: [  1024,     1,     1]            UNARY                                   mlp_pre_relu-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms
 -  32: [ 11008,     1,     1]          MUL_MAT                                    mlp_pre_out-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.117 /   0.117 ms
 -  33: [ 11008,     1,     1]   MUL_MAT_SPARSE                                ffn_gate_sparse-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.061 /   0.061 ms
 -  34: [ 11008,     1,     1]            UNARY                                   ffn_gate_act-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms
 -  35: [ 11008,     1,     1]   MUL_MAT_SPARSE                                  ffn_up_sparse-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.059 /   0.059 ms
 -  36: [ 11008,     1,     1]              MUL                                   ffn_gate_par-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -  37: [  4096,     1,     1]             AXPY                                ffn_down_sparse-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.322 /   0.322 ms
 -  38: [  4096,     1,     1]              ADD                                          l_out-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms
 -  39: [  4096,     1,     1]         RMS_NORM                                           norm-1   (  1) cpu =   0.000 /   0.000 ms, wall =   0.020 /   0.020 ms
 -  40: [  4096,     1,     1]              MUL                                      attn_norm-1   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms
 
...

4090
=== GRAPH ===
n_nodes = 1220
 -   0: [  4096,     1,     1]         GET_ROWS                                         inp_embd   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms
 -   1: [  4096,     1,     1]         RMS_NORM                                           norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms
 -   2: [  4096,     1,     1]              MUL                                      attn_norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms
 -   3: [  4096,     1,     1]          MUL_MAT                                           Vcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.052 /   0.052 ms
 -   4: [  4096,     1,     1]          RESHAPE                                Vcur-0 (reshaped)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms
 -   5: [     1,  4096,     1]        TRANSPOSE                                        v_cur_t-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -   6: [     1,  4096,     1]             VIEW                                   v_cache_view-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms
 -   7: [     1,  4096,     1]              CPY               v_cache_view-0 (copy of v_cur_t-0)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -   8: [    32,   128,    32]             VIEW                                              v-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -   9: [  4096,     1,     1]          MUL_MAT                                           Kcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.048 /   0.048 ms
 -  10: [   128,    32,     1]          RESHAPE                                Kcur-0 (reshaped)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  11: [   128,    32,     1]             ROPE                                           Kcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  12: [  4096,     1,     1]             VIEW                                   k_cache_view-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms
 -  13: [  4096,     1,     1]              CPY                  k_cache_view-0 (copy of Kcur-0)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  14: [   128,    32,    32]             VIEW                                              k-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  15: [  4096,     1,     1]          MUL_MAT                                           Qcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.048 /   0.048 ms
 -  16: [   128,    32,     1]          RESHAPE                                Qcur-0 (reshaped)   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  17: [   128,    32,     1]             ROPE                                           Qcur-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  18: [   128,     1,    32]          PERMUTE                                              q-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms
 -  19: [    32,     1,    32]          MUL_MAT                                             kq-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.013 /   0.013 ms
 -  20: [    32,     1,    32]            SCALE                                      kq_scaled-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.008 ms
 -  21: [    32,     1,    32]              ADD                                      kq_masked-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  22: [    32,     1,    32]         SOFT_MAX                                    kq_soft_max-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.006 ms
 -  23: [   128,     1,    32]          MUL_MAT                                            kqv-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms
 -  24: [   128,    32,     1]          PERMUTE                                     kqv_merged-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms
 -  25: [  4096,     1,     1]             CONT                                kqv_merged_cont-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms
 -  26: [  4096,     1,     1]          MUL_MAT                                        kqv_out-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.047 /   0.047 ms
 -  27: [  4096,     1,     1]              ADD                                        ffn_inp-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms
 -  28: [  4096,     1,     1]         RMS_NORM                                           norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  29: [  4096,     1,     1]              MUL                                       ffn_norm-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms
 -  30: [  1024,     1,     1]          MUL_MAT                                 mlp_pre_hidden-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.022 ms
 -  31: [  1024,     1,     1]            UNARY                                   mlp_pre_relu-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  32: [ 11008,     1,     1]          MUL_MAT                                    mlp_pre_out-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.035 /   0.035 ms
 -  33: [ 11008,     1,     1]   MUL_MAT_SPARSE                                ffn_gate_sparse-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.025 /   0.025 ms
 -  34: [ 11008,     1,     1]            UNARY                                   ffn_gate_act-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  35: [ 11008,     1,     1]   MUL_MAT_SPARSE                                  ffn_up_sparse-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms
 -  36: [ 11008,     1,     1]              MUL                                   ffn_gate_par-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  37: [  4096,     1,     1]             AXPY                                ffn_down_sparse-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.040 /   0.040 ms
 -  38: [  4096,     1,     1]              ADD                                          l_out-0   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  39: [  4096,     1,     1]         RMS_NORM                                           norm-1   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms
 -  40: [  4096,     1,     1]              MUL                                      attn_norm-1   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms

...

We can see that CPU time always 0, which shows we mostly didn’t run something on CPU

So, go for the next step, we need to figure out where invoke so many unexpected Memcpy call(exact code line).

4. Code traceability

First, refered with profiling result, we can see all unexcepted Memcpy are Host2Device. I learned about coda/hip api about memcpy:

hipMemcpy

I used CTRL+F to search all memcpy action and hipMemcpyKind kind to locate potential code line.

Then I inserted some useless hip-api function near them:

# Use cudaDeviceSynchronize to mark this piece of code for further profiling
static void ggml_backend_cuda_set_tensor_async(ggml_backend_t backend, ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
    GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds");
    GGML_ASSERT(tensor->data != NULL && "tensor not allocated");
    GGML_ASSERT(tensor->backend == GGML_BACKEND_GPU);
    cudaDeviceSynchronize();
    CUDA_CHECK(cudaMemcpyAsync((char *)tensor->data + offset, data, size, cudaMemcpyHostToDevice, g_cudaStreams[g_main_device][0]));
    cudaDeviceSynchronize();
    UNUSED(backend);
}
# Use Malloc&free to mark this piece of code for further profiling
void* dummy_ptr = nullptr;
cudaMalloc(&dummy_ptr, 1);

if (dummy_ptr != nullptr) {
    CUDA_CHECK(cudaFree(dummy_ptr));
    dummy_ptr = nullptr;
}

const char * x = src_ptr + i1_low*nb1 + i2*nb2 + i3*nb3;
if (nb0 == ts && nb1 == ts*ne0/bs) {return cudaMemcpyAsync(dst_ptr, x, i1_diff*nb1, kind, stream);} 
else if (nb0 == ts) {return cudaMemcpy2DAsync(dst_ptr, ts*ne0/bs, x, nb1, ts*ne0/bs, i1_diff, kind, stream);} else {
    for (int64_t i1 = 0; i1 < i1_diff; i1++) {
        const void * rx = (const void *) ((const char *) x + i1*nb1);
        void * rd = (void *) (dst_ptr + i1*ts*ne0/bs);
        cudaError_t r = cudaMemcpy2DAsync(rd, ts/bs, rx, nb0, ts/bs, ne0, kind, stream);
        if (r != cudaSuccess) return r;
    }
    return cudaSuccess;
}

Then profile it again, I got the following result:

backend_cuda_set_with_devicesync.pftrace:

backend_cuda_set_with_devicesync

backend_cuda_set_without_devicesync.pftrace:

backend_cuda_set_without_devicesync

cpy_tensor_2d_with_mallocfree.pftrace:

cpy_tensor_2d_with_mallocfree

We can see HipFree label showed in this test, which means this function caused unexpected memcpy:

static cudaError_t ggml_cuda_cpy_tensor_2d(
    void * dst, const struct ggml_tensor * src, int64_t i3, int64_t i2, int64_t i1_low, int64_t i1_high, cudaStream_t stream) {

    cudaMemcpyKind kind;
    char * src_ptr;
    if (src->backend == GGML_BACKEND_CPU) {
        kind = cudaMemcpyHostToDevice;
        src_ptr = (char *) src->data;
    } else if (src->backend == GGML_BACKEND_GPU || src->backend == GGML_BACKEND_GPU_SPLIT) {
        GGML_ASSERT(src->backend != GGML_BACKEND_GPU_SPLIT || (i1_low == 0 && i1_high == src->ne[1]));
        kind = cudaMemcpyDeviceToDevice;
        ggml_tensor_extra_gpu * extra = (ggml_tensor_extra_gpu *) src->extra;
        int id;
        CUDA_CHECK(cudaGetDevice(&id));
        src_ptr = (char *) extra->data_device[id];
    } else {
        GGML_ASSERT(false);
    }
    char * dst_ptr = (char *) dst;

    const int64_t ne0 = src->ne[0];
    const int64_t nb0 = src->nb[0];
    const int64_t nb1 = src->nb[1];
    const int64_t nb2 = src->nb[2];
    const int64_t nb3 = src->nb[3];
    const enum ggml_type type = src->type;
    const int64_t ts = ggml_type_size(type);
    const int64_t bs = ggml_blck_size(type);
    int64_t i1_diff = i1_high - i1_low;

    const char * x = src_ptr + i1_low*nb1 + i2*nb2 + i3*nb3;
    if (nb0 == ts && nb1 == ts*ne0/bs) {
        return cudaMemcpyAsync(dst_ptr, x, i1_diff*nb1, kind, stream);
    } else if (nb0 == ts) {
        return cudaMemcpy2DAsync(dst_ptr, ts*ne0/bs, x, nb1, ts*ne0/bs, i1_diff, kind, stream);
    } else {
        for (int64_t i1 = 0; i1 < i1_diff; i1++) {
            const void * rx = (const void *) ((const char *) x + i1*nb1);
            void * rd = (void *) (dst_ptr + i1*ts*ne0/bs);
            // pretend the row is a matrix with cols=1
            cudaError_t r = cudaMemcpy2DAsync(rd, ts/bs, rx, nb0, ts/bs, ne0, kind, stream);
            if (r != cudaSuccess) return r;
        }
        return cudaSuccess;
    }
}

To get finer-grain result, I profiled deeper for two functions:

matmul_src1_cpytensor_mallocfree.pftrace:

matmul_src1_cpytensor_mallocfree

No HipFree label, which means not this function caused unexpected memcpy.

matmul_src0_cpytensor_mallocfree.pftrace:

matmul_src0_cpytensor_mallocfree

No HipFree label, which means not this function caused unexpected memcpy.

src0_cuda_op_flatten_mallocfree.pftrace:

src0_cuda_op_flatten_mallocfree

No HipFree label, which means not this function caused unexpected memcpy.

src1_cuda_op_flatten_mallocfree.pftrace:

src1_cuda_op_flatten_mallocfree

We can see HipFree label showed in this test, which means this piece of code caused unexpected memcpy:

if (use_src1 && !src1_stays_on_host) {
        if (src1_on_device) {
            src1_ddf = (float *) src1_extra->data_device[g_main_device];
        } else {
            src1_ddf = (float *) ggml_cuda_pool_malloc(ggml_nbytes(src1), &src1_asf);
            CUDA_CHECK(ggml_cuda_cpy_tensor_2d(src1_ddf, src1, 0, 0, 0, nrows1, main_stream));
        }
    }

5. Unexpected memcpy eliminate

This exception is mainly caused by the ffn_norm_weight, which is a bug in the project itself. The project logic automatically selects which parameters to put on the GPU based on the size of the graphics memory. However, the norm weight was previously placed in a specific class, and everything in that class will be placed on the CPU no matter what. Therefore, this leads to memcpy behavior from the CPU to the GPU every time these parameters are used. The corresponding solution is to change the storage in the class to a different location so that it can also enter the GPU during the initial load, instead of using memcpy from the CPU every time it is used.

After bug fix, we got the following result:

Memcpy_afterfix

Unfortunately, the problem remains unresolved. But specifically, we have identified the essence of the problem: operators. From HipStream, we can see that the operation of the kernel is continuous, which proves that the reason for poor performance is not in the MemcpyAsync we have verified, but in the poor implementation of the operator. To solve this problem, we must find the typical operator with the worst performance and optimize the corresponding operator.

6. Operator optimization

Operator optimization is a heavy work, so we first structed a statistical data for running time:

                                                 Name  Calls  Percentage  TotalDurationMs  AverageMs
0   Cijk_Alik_Bljk_HB_MT32x32x32_MI16x16x16x1_SN_1...   5888   56.977682       492.232145   0.083599
1   void dequantize_mul_mat_vec<1, 1, &(convert_f1...   1452   14.746984       127.399699   0.087740
2   Cijk_Alik_Bljk_HB_MT32x32x32_MI16x16x16x1_SN_1...   1288    9.408184        81.277628   0.063103
3   void dequantize_block<1, 1, &(convert_f32(void...  10370    3.461931        29.907737   0.002884
4   void dequantize_block<1, 1, &(convert_f16(void...  10370    2.742961        23.696530   0.002285
5   void rms_norm_f32<1024>(float const*, float*, ...   2990    1.774599        15.330825   0.005127
6   add_f32(float const*, float const*, float*, in...   4416    1.403729        12.126866   0.002746
7   void rope<float, true>(float const*, float*, i...   2944    1.329179        11.482825   0.003900
8   Cijk_Alik_Bljk_HB_MT32x32x32_MI16x16x16x1_SN_1...    184    1.019050         8.803607   0.047845
9   void cpy_f32_f16<&(cpy_1_f32_f16(char const*, ...   2944    1.014060         8.760496   0.002975
10  Cijk_Alik_Bljk_HB_MT32x32x16_MI16x16x16x1_SN_1...   1472    1.012366         8.745866   0.005941
11  mul_f32(float const*, float const*, float*, in...   2990    0.999206         8.632172   0.002887
12  Cijk_Alik_Bljk_HB_MT32x32x32_MI16x16x16x1_SN_1...   1408    0.798503         6.898291   0.004899
13  Cijk_Alik_Bljk_HB_MT64x64x32_MI16x16x16x1_SN_1...     52    0.734857         6.348453   0.122085
14  soft_max_f32(float const*, float*, int) [clone...   1472    0.642171         5.547740   0.003768
15  void cpy_f32_f16<&(cpy_1_f32_f32(char const*, ...   1472    0.444675         3.841566   0.002609
16  scale_f32(float const*, float*, float, int) [c...   1472    0.429947         3.714326   0.002523
17    relu_f32(float const*, float*, int) [clone .kd]   1472    0.391271         3.380202   0.002296
18                  __amd_rocclr_fillBufferAligned.kd      6    0.328644         2.839166   0.473194
19  Cijk_Alik_Bljk_HB_MT64x64x32_MI16x16x16x1_SN_1...      2    0.153582         1.326803   0.663401
20  Cijk_Alik_Bljk_HB_MT64x64x32_MI16x16x16x1_SN_1...     12    0.123283         1.065044   0.088753
21  Cijk_Alik_Bljk_HB_GB_MT32x32x32_MI16x16x16x1_S...     64    0.040560         0.350400   0.005475
22  k_compute_batched_ptrs(__half const*, __half c...     64    0.022577         0.195040   0.003047

Cijk_Alik_Bljk… functions are internal rocblas/hipblas calls, which come from AMD’s Tensile library: https://github.com/ROCm/Tensile. It’s realatively hard to optimize directly. So we targeted the operator about mul_mat_vec, that is, attention part. To be more specific, mul_mat_vec_q4_0_q8_1_cuda():

static void mul_mat_vec_q4_0_q8_1_cuda(const void * vx, const void * vy, float * dst, const int ncols, const int nrows, cudaStream_t stream) {
    GGML_ASSERT(ncols % QK4_0 == 0);
    const int block_num_y = (nrows + GGML_CUDA_MMV_Y - 1) / GGML_CUDA_MMV_Y;
    const dim3 block_nums(block_num_y, 1, 1);
    const dim3 block_dims(WARP_SIZE, GGML_CUDA_MMV_Y, 1);
    mul_mat_vec_q<QK4_0, QI4_0, block_q4_0, VDR_Q4_0_Q8_1_MMVQ, vec_dot_q4_0_q8_1>
        <<<block_nums, block_dims, 0, stream>>>(vx, vy, dst, ncols, nrows);
}

static void mul_mat_vec_q4_0_q8_1_cuda(const void * vx, const void * vy, float * dst, const int ncols, const int nrows, cudaStream_t stream, const int *lst, const float *idx) {
    GGML_ASSERT(ncols % QK4_0 == 0);
    const int block_num_y = (nrows + GGML_CUDA_MMV_Y - 1) / GGML_CUDA_MMV_Y;
    const dim3 block_nums(block_num_y, 1, 1);
    const dim3 block_dims(WARP_SIZE, GGML_CUDA_MMV_Y, 1);
    mul_mat_vec_q<QK4_0, QI4_0, block_q4_0, VDR_Q4_0_Q8_1_MMVQ, vec_dot_q4_0_q8_1>
        <<<block_nums, block_dims, 0, stream>>>(vx, vy, dst, ncols, nrows, lst, idx);
}

Then we insert some timing code for both 4090 and 7900 on both Llama.cpp project and PowerInfer project to compare the results:

static void ggml_cuda_op_mul_mat(
    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, ggml_cuda_op_mul_mat_t op,
    const bool convert_src1_to_q8_1) {

    uint64_t t1 = ggml_time_us();
    uint64_t t2 = ggml_time_us();
    uint64_t t3 = ggml_time_us();
    uint64_t t4 = ggml_time_us();
  
 		...
      
      
    cudaDeviceSynchronize();
    t2 = ggml_time_us();
  
    // do the computation
    op(src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i,
       row_low[id], row_high[id], src1_ncols, src1_padded_col_size, stream);
    CUDA_CHECK(cudaGetLastError());
  
    cudaDeviceSynchronize();
    t3 = ggml_time_us();
  
  	...
      
    cudaDeviceSynchronize();
    t4 = ggml_time_us();
    if (dst->ne[0] == 4096 && dst->ne[1] == 1 && dst->src[0]->ne[0] == 4096) {
        printf("transfer src to gpu %ld us\n", t2-t1); 
        printf("computation on gpu %ld us\n", t3-t2);
        printf("write back %ld us\n", t4-t3);
    }
}

The result are as follows(Both on Q4 model, mistral-7b with llama, bamboo-base-7b with PowerInfer):

Llama + 4090 Llama + 7900 PowerInfer + 4090 PowerInfer + 7900
Transfer to GPU 4us 25us 0us 17-71us
Computation on G 14us 31us 7us 33-308us
Write back 1us 0us 14us 0-1us

This is consistent with the results we obtained from our previous tests, where 7900 performed stably and fast in the llama.cpp project and had no significant fluctuations in running time. However, the computation time fluctuation of 7900 in the PowerInfer project is severe, which is the root cause of its poor performance.

Next step is to analyze why this significant fluctuations happened on the 7900 and PowerInfer project.

We consider that the most likely reason is the difference caused by the different parameters passed in each op operation, so we added loop runs with the same parameters to the op operation to verify our hypothesis:

static void ggml_cuda_op_mul_mat(
    const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, ggml_cuda_op_mul_mat_t op,
    const bool convert_src1_to_q8_1) {

    uint64_t t1 = ggml_time_us();
    uint64_t t2 = ggml_time_us();
    uint64_t t3 = ggml_time_us();
    uint64_t t4 = ggml_time_us();
  
 		...
      
  
    // do the computation
    for(int i = 0; i < 7; i++)
    {
        printf("round %d\n",i);

        cudaDeviceSynchronize();
        t2 = ggml_time_us();

        op(src0, src1, dst, src0_dd_i, src1_ddf_i, src1_ddq_i, dst_dd_i,
        row_low[id], row_high[id], src1_ncols, src1_padded_col_size, stream);
        CUDA_CHECK(cudaGetLastError());

        cudaDeviceSynchronize();
        t3 = ggml_time_us();

        printf("computation on gpu %ld us, src0_name: %s, src1_name: %s\n", t3-t2, src0->name, src1->name);
    }
  
  	...
      
}

We got the following result:

round 0
computation on gpu 118 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 1
computation on gpu 35 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 2
computation on gpu 34 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 3
computation on gpu 34 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 4
computation on gpu 34 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 5
computation on gpu 34 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 6
computation on gpu 35 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
round 0
computation on gpu 97 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 1
computation on gpu 34 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 2
computation on gpu 34 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 3
computation on gpu 34 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 4
computation on gpu 34 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 5
computation on gpu 33 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 6
computation on gpu 34 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
round 0
computation on gpu 122 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 1
computation on gpu 26 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 2
computation on gpu 26 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 3
computation on gpu 25 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 4
computation on gpu 25 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 5
computation on gpu 25 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 6
computation on gpu 25 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
round 0
computation on gpu 67 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 1
computation on gpu 25 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 2
computation on gpu 24 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 3
computation on gpu 25 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 4
computation on gpu 25 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 5
computation on gpu 25 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 6
computation on gpu 25 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
round 0
computation on gpu 34 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 1
computation on gpu 21 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 2
computation on gpu 21 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 3
computation on gpu 21 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 4
computation on gpu 21 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 5
computation on gpu 21 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 6
computation on gpu 20 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
round 0
computation on gpu 49 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 1
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 2
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 3
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 4
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 5
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 6
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
round 0
computation on gpu 48 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 1
computation on gpu 27 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 2
computation on gpu 27 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 3
computation on gpu 29 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 4
computation on gpu 27 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 5
computation on gpu 27 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 6
computation on gpu 27 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
round 0
computation on gpu 44 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 1
computation on gpu 29 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 2
computation on gpu 28 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 3
computation on gpu 27 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 4
computation on gpu 28 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 5
computation on gpu 28 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 6
computation on gpu 27 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
round 0
computation on gpu 310 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
round 1
computation on gpu 302 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
round 2
computation on gpu 323 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
round 3
computation on gpu 330 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
round 4
computation on gpu 281 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
round 5
computation on gpu 326 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
round 6
computation on gpu 285 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0
computation on gpu 35 us, src0_name: blk.0.attn_v.weight, src1_name: attn_norm-0
computation on gpu 34 us, src0_name: blk.0.attn_k.weight, src1_name: attn_norm-0
computation on gpu 26 us, src0_name: blk.0.attn_q.weight, src1_name: attn_norm-0
computation on gpu 25 us, src0_name: blk.0.attn_output.weight, src1_name: kqv_merged_cont-0
computation on gpu 21 us, src0_name: blk.0.fc1.weight, src1_name: ffn_inp-0
computation on gpu 23 us, src0_name: blk.0.fc2.weight, src1_name: mlp_pre_relu-0
computation on gpu 27 us, src0_name: blk.0.ffn_gate.weight, src1_name: ffn_norm-0
computation on gpu 29 us, src0_name: blk.0.ffn_up.weight, src1_name: ffn_norm-0
computation on gpu 302 us, src0_name: blk.0.ffn_down_t.weight, src1_name: ffn_gate_par-0

At this point, the problem has finally narrowed down to the specific operator problem, which is the sparse operator problem independently implemented in the PowerInfer project.

static void dequantize_axpy_vec_q4_0_cuda(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream) {
    GGML_ASSERT(ncols % GGML_CUDA_DMMV_X == 0);
    const int block_num_y = (nrows + GGML_CUDA_MMV_Y - 1) / GGML_CUDA_MMV_Y;
    const dim3 block_nums(1, block_num_y, 1);
    const dim3 block_dims(WARP_SIZE, GGML_CUDA_MMV_Y, 1);
    dequantize_mul_mat_axpy<QK4_0, QR4_0, dequantize_q4_0>
        <<<block_nums, block_dims, ncols*sizeof(float), stream>>>(vx, y, dst, ncols, nrows);
}
static void dequantize_axpy_sparse_vec_q4_0_cuda(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx)  {
    GGML_ASSERT(ncols % GGML_CUDA_DMMV_X == 0);
    const int block_num_y = (nrows + GGML_CUDA_MMV_Y - 1) / GGML_CUDA_MMV_Y;
    const dim3 block_nums(1, block_num_y, 1);
    const dim3 block_dims(WARP_SIZE, GGML_CUDA_MMV_Y, 1);
    dequantize_mul_mat_axpy_sparse<QK4_0, QR4_0, dequantize_q4_0>
        <<<block_nums, block_dims, ncols*sizeof(float), stream>>>(vx, y, dst, ncols, nrows, lst, idx);
}

kernel function is:

template <int qk, int qr, dequantize_kernel_t dequantize_kernel>
static __global__ void dequantize_mul_mat_axpy_sparse(const void * __restrict__ vx, const dfloat * __restrict__ y, float * __restrict__ dst, const int ncols, const int nrows, int *lst, float *idx) {
    const int gpu_row = blockIdx.y*blockDim.y + threadIdx.y;

    if (gpu_row >= nrows) {
        return;
    }
    int row = lst ? lst[gpu_row] : gpu_row;
    const int tid = threadIdx.x;
    short *d = (short *)((char *)vx + ncols * gpu_row * 2);

    if (y[row] == 0)
        return;
    if (idx[row] < dev_sparse_threshold) {
        return;
    }

    const int bid = blockIdx.y;

    extern __shared__ float shared_dst[];

    const int iter_stride = 2*GGML_CUDA_DMMV_X;
    const int vals_per_iter = iter_stride / WARP_SIZE;
    const int y_offset = qr == 1 ? 1 : qk/2;

// partial sum for each thread
    float tmp = 0.0f;
    for (int i = 0; i < ncols; i += GGML_CUDA_DMMV_X) {
        shared_dst[i+tid] = 0;
    }
    __syncthreads();

    for (int i = 0; i < ncols; i += iter_stride) {
        const int col = i + vals_per_iter*tid;
        const int ib = (gpu_row*ncols + col)/qk; // x block index
        const int iqs = (col%qk)/qr; // x quant index
        const int iybs = col - col%qk; // y block start index

// processing >2 values per i iter is faster for fast GPUs
#pragma unroll
        for (int j = 0; j < vals_per_iter; j += 2) {
            dfloat2 v;
            dequantize_kernel(vx, ib, iqs + j/qr, v);

            tmp = v.x * y[row];
            shared_dst[iybs + iqs + j/qr + 0] = tmp;
            tmp = v.y * y[row];
            shared_dst[iybs + iqs + j/qr + y_offset] = tmp;
            
        }
    }
    __syncthreads();

    for (int i = 0; i < ncols; i += GGML_CUDA_DMMV_X) {
        atomicAdd(&dst[i+tid], shared_dst[i+tid]);
        // dst[i+tid] = shared_dst[i+tid]
    }
}

In this operator, I noticed that the final atomicadd operation was very suspicious, and further, I replaced it with dst[i+tid] += shared_dst[i+tid]; Obviously, this will yield the wrong answer, but it will validate performance issues. As we expected, after replacement, the performance has increased from 40 token/s to 85 token/s. This is a very obvious bottleneck.

As shown in the above figure, there is no performance difference between nvidia and amd on atomicadd, so we speculate that the root cause of the performance problem is: When multiple threads attempt to access the same position in the local memory dst array simultaneously, it leads to decreased concurrency, almost linearly.

Algorithm optimization:

Before:

template <int qk, int qr, dequantize_kernel_t dequantize_kernel>
static __global__ void dequantize_mul_mat_axpy_sparse(const void * __restrict__ vx, const dfloat * __restrict__ y, float * __restrict__ dst, const int ncols, const int nrows, int *lst, float *idx) {
    // qk = quantized weights per x block
    // qr = number of quantized weights per data value in x block
    const int gpu_row = blockIdx.y*blockDim.y + threadIdx.y;
 
    if (gpu_row >= nrows) {
        return;
    }
    int row = lst ? lst[gpu_row] : gpu_row;
    const int tid = threadIdx.x;
    short *d = (short *)((char *)vx + ncols * gpu_row * 2);
 
    if (y[row] == 0)
        return;
    if (idx[row] < dev_sparse_threshold) {
        return;
    }
 
    const int bid = blockIdx.y;
 
    extern __shared__ float shared_dst[]; // TODO:dynamic
 
    const int iter_stride = 2*GGML_CUDA_DMMV_X;
    const int vals_per_iter = iter_stride / WARP_SIZE; // num quantized vals per thread and i iter
    const int y_offset = qr == 1 ? 1 : qk/2;
 
// partial sum for each thread
    float tmp = 0.0f;
    for (int i = 0; i < ncols; i += GGML_CUDA_DMMV_X) {
        shared_dst[i+tid] = 0;
    }
    __syncthreads();
 
    for (int i = 0; i < ncols; i += iter_stride) {
        const int col = i + vals_per_iter*tid;
        const int ib = (gpu_row*ncols + col)/qk; // x block index
        const int iqs = (col%qk)/qr; // x quant index
        const int iybs = col - col%qk; // y block start index
 
// processing >2 values per i iter is faster for fast GPUs
#pragma unroll
        for (int j = 0; j < vals_per_iter; j += 2) {
            // process 2 vals per j iter
 
            // dequantize
            // for qr = 2 the iqs needs to increase by 1 per j iter because 2 weights per data val
            dfloat2 v;
            dequantize_kernel(vx, ib, iqs + j/qr, v);
 
            // matrix multiplication
            // for qr = 2 the y index needs to increase by 1 per j iter because of y_offset = qk/2
            tmp = v.x * y[row];
            shared_dst[iybs + iqs + j/qr + 0] = tmp;
            tmp = v.y * y[row];
            shared_dst[iybs + iqs + j/qr + y_offset] = tmp;
           
        }
    }
    __syncthreads();
 
    for (int i = 0; i < ncols; i += GGML_CUDA_DMMV_X) {
        atomicAdd(&dst[i+tid], shared_dst[i+tid]);
    }
}
static void dequantize_axpy_sparse_batch_q4_0_cuda(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, int src1_rows, int src1_ncols, cudaStream_t stream, int *lst, float *idx) {
    GGML_ASSERT(ncols % GGML_CUDA_DMMV_X == 0);
    const int block_num_y = (nrows + GGML_CUDA_MMV_Y - 1) / GGML_CUDA_MMV_Y;
    const dim3 block_nums(1, block_num_y, 1);
    const dim3 block_dims(WARP_SIZE, GGML_CUDA_MMV_Y, 1);
    dequantize_mul_mat_axpy_sparse_batch<QK4_0, QR4_0, dequantize_q4_0>
        <<<block_nums, block_dims, ncols*sizeof(float), stream>>>(vx, y, dst, ncols, nrows, src1_rows, src1_ncols, lst, idx);
}

After:

template <int qk, int qr, dequantize_kernel_t dequantize_kernel>
static __global__ void my_1col_new_dequantize_mul_mat_axpy_sparse_batch(const void * __restrict__ vx, const dfloat * __restrict__ y, float * __restrict__ dst, const int ncols, const int nrows, int *lst, float *idx) {
    // printf("in 1col kernel\n");
    int warp_id = threadIdx.y;
    int tid = threadIdx.x + blockIdx.x * 32;
    int col = tid * 2;
    dfloat2 v;
    int iqs = (col % qk) / qr;
    float tmp[2];
    tmp[0] = 0.0;
    tmp[1] = 0.0;
    __shared__ float res[64];
    res[threadIdx.x] = 0.0;
    res[threadIdx.x + 32] = 0.0;
 
#pragma unroll 32
    for (int row = warp_id; row < nrows; row += 32) {
        int raw_row = row;
        // int raw_row = row;
        dfloat y_row = y[raw_row];
        if (y_row == 0.0) {
            continue;
        }
        const int ib = (row * ncols + col) / qk;
        dequantize_kernel(vx, ib, iqs, v);
        tmp[0] += v.x * y_row;
        tmp[1] += v.y * y_row;
    }
    const int adder_loc = threadIdx.x % 16 + threadIdx.x / 16 * 32;
    atomicAdd(res + adder_loc, tmp[0]);
    atomicAdd(res + adder_loc + 16, tmp[1]);
    __syncthreads();
    if (warp_id <= 1) {
        int write_back_loc = warp_id * 32 + threadIdx.x;
        dst[write_back_loc + blockIdx.x * 64] = res[write_back_loc];
    }
}
static void dequantize_axpy_sparse_vec_q4_0_cuda(const void * vx, const dfloat * y, float * dst, const int ncols, const int nrows, cudaStream_t stream, int *lst, float *idx)  {
    GGML_ASSERT(ncols % GGML_CUDA_DMMV_X == 0);
 
    const dim3 block_dim = dim3(32, 32);
    const int block_num = ncols / 64;
 
    my_1col_new_dequantize_mul_mat_axpy_sparse_batch<QK4_0, QR4_0, dequantize_q4_0>
        <<<block_num, block_dim, 0, stream>>>(vx, y, dst, ncols, nrows, lst, idx);
 
}

7. Final Optimization Result

PowerInfer_output

7. Future Work

We avoided atomicadd as much as possible through algorithms, but the objective fact is that AMD devices perform extremely poorly in situations of high concurrency atomicadd, while Nvidia’s support is very good. This may be due to the lack of efficiency in implementing AMD atomicadd; It may also be because the compiler under the AMD toolchain is not intelligent enough to handle similar scenarios. And finding a more specific reason is that the follow-up project is worth doing, which is enough to significantly improve the efficiency of AMD machines in general scenarios.