Environment:

aupxtx@aupxtx:~$ python3 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.3.0+rocm5.7
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.7.31921-d1770ee1b

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Radeon RX 7900 XTX (gfx1100)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.7.31921
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          16
On-line CPU(s) list:             0-15
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 7 7800X3D 8-Core Processor
CPU family:                      25
Model:                           97
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
Stepping:                        2
Frequency boost:                 enabled
CPU max MHz:                     5049.0229
CPU min MHz:                     3000.0000
BogoMIPS:                        8399.69
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                  AMD-V
L1d cache:                       256 KiB (8 instances)
L1i cache:                       256 KiB (8 instances)
L2 cache:                        8 MiB (8 instances)
L3 cache:                        96 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] pytorch-triton-rocm==2.3.0
[pip3] torch==2.3.0+rocm5.7
[pip3] torchaudio==2.3.0+rocm5.7
[pip3] torchvision==0.18.0+rocm5.7

Debug Log

1. Fail to generate GPU split

.............................................
invoking powerinfer Python module to generate gpu split for 59036.48 MiB of VRAM
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy-py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xxx/.local/lib/python3.8/site-packages/powerinfer/__main__.py", line 5, in <module>
    from .export_split import export_split
  File "/home/xxx/.local/lib/python3.8/site-packages/powerinfer/export_split.py", line 50, in <module>
    def export_split(activations_path: str, output_path: str, solved_list: list[int], vram_capacity: int):
TypeError: 'type' object is not subscriptable
l1m_load_gpu_split_with_budget: error: failed to generate gpu split
llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (6.43 ms)
1lm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama new context with model: freq scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama kv cache init: offloading k cache to GPU
llama kv cache init: VRAM kv self = 256.00 MB
11ama_new_context_with_model: kv self size  = 256.00 MB
llama_build_graph: non-view tensors processed: 548/836

Initial Python version is 3.8.xx, which not satisfied with Pre-requisites.

Solution: Upgrade Python Version to 3.8+

2. Segmentation fault (core dumped)

llama_model_loader: - tensor   58:                   blk.29.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   59:                blk.29.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 18367365120
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (0.75 ms)
offload_ffn_split: applying augmentation to model - please wait ...
Segmentation fault (core dumped)

Now this bug has been fixed! Please refer to https://github.com/SJTU-IPADS/PowerInfer/pull/139

Attempted solutions: Change cudaMemcpyToSymbol(dev_sparse_threshold, &sparse_pred_threshold, sizeof(float)) to cudaMemcpyToSymbol(&dev_sparse_threshold, &sparse_pred_threshold, sizeof(float)) But Bug 3 occur

3. CUDA error 13: invalid device symbol

llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [  1792,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  2048,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 2093465600
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (1.91 ms)
offload_ffn_split: applying augmentation to model - please wait ...

CUDA error 13 at /var/lib/jenkins/PowerInfer/ggml-cuda.cu:9440: invalid device symbol
current device: 0

Now this bug has been fixed! Please refer to https://github.com/SJTU-IPADS/PowerInfer/pull/139

4. CUDA error 303: shared object initialization failed

llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [  1792,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  2048,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 2093465600
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (1.81 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (1764.28 ms)
llm_load_gpu_split: offloaded 1980.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 36.25 MB
llama_new_context_with_model: VRAM scratch buffer: 34.69 MB
llama_new_context_with_model: total VRAM used: 8210.20 MB (model: 5939.52 MB, context: 290.69 MB)

CUDA error 303 at /var/lib/jenkins/PowerInfer/ggml-cuda.cu:7877: shared object initialization failed
current device: 0

All kernel function can’t be launched correctly and all trap into CUDA error 303!!!!!

Solution: Add additional compilation options: -DAMDGPU_TARGETS=gfx1100 (Replace 1100 to your card architecture, you can get it by rocminfo)

What I can confirm is that the program has just finished executing the llama.cpp function:

struct llama_context * llama_new_context_with_model( struct llama_model * model, struct llama_context_params params) {

According to the correct log record, the program should run the llama.cpp function afterwards:

Const char * llama_print_system_info (void){

And between these two, according to the log, it jumps into ggml cuda.cu:

static void ggml_cuda_op_flatten(const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const ggml_cuda_op_flatten_t op) {

And in the function of ggml cuda.cu:

op(src0, src1, dst, src0_ddf, src1_ddf, dst_ddf, main_stream);

An error occurred during operation, resulting in CUDA error 303 at the end.

5. Segmentation fault (core dumped)

llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [  1792,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  2048,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 2093465600
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (1.76 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (1744.70 ms)
llm_load_gpu_split: offloaded 1980.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 36.25 MB
llama_new_context_with_model: VRAM scratch buffer: 34.69 MB
llama_new_context_with_model: total VRAM used: 8210.20 MB (model: 5939.52 MB, context: 290.69 MB)
Segmentation fault (core dumped)

I add some label in the code and claim that program can finish function llama_new_context_with_model and haven’t get into function const char * llama_print_system_info(void). Except that, all CUDA function can be executed correctly.

llama_new_context_with_model: compute buffer total size = 36.25 MB
llama_new_context_with_model: VRAM scratch buffer: 34.69 MB
llama_new_context_with_model: total VRAM used: 8210.20 MB (model: 5939.52 MB, context: 290.69 MB)
111
222
Operation: ggml_cuda_op_rms_norm
Operation: ggml_cuda_op_mul
Operation: ggml_cuda_op_rope
Operation: ggml_cuda_op_rope
Operation: ggml_cuda_op_scale
Operation: ggml_cuda_op_add
add_finish
Operation: ggml_cuda_op_soft_max
Operation: ggml_cuda_op_add
add_finish
Operation: ggml_cuda_op_rms_norm
Operation: ggml_cuda_op_mul
Operation: ggml_cuda_op_relu
Operation: ggml_cuda_op_add
add_finish
Segmentation fault (core dumped)

Solution: Add an additional running command parameter: –reset-gpu-index (To avoid any stale cache.)

6. Finish

#How to run it correctly
rm -rf build
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100
cmake --build build --config Release -j 24
./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf -n 128 -p "Once upon a time" --ignore-eos --seed 0 --top-k 1 --reset-gpu-index

An correct output just like the following:

root@5de7c34ac60d:/var/lib/jenkins/PowerInfer# ./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf -n 128 -t 1 -p "Once upon a time" --ignore-eos --seed 0 --top-k 1 --reset-gpu-index
Log start
main: build = 1572 (47e9d7e)
main: built with AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.0 23483 7208e8d15fbf218deb74483ea8c549c67ca4985e) for x86_64-unknown-linux-gnu
main: seed  = 0
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
llama_model_loader: loaded meta data with 18 key-value pairs and 355 tensors from ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    7:          blk.0.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    8:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    9:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   11:              blk.1.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   12:              blk.1.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   13:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   16:          blk.1.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   17:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:              blk.2.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:              blk.2.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   21:              blk.2.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   22:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   23:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   24:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   25:          blk.2.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   26:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   27:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   28:              blk.3.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   29:              blk.3.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   30:              blk.3.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   31:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   33:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   34:          blk.3.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   35:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:              blk.4.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:              blk.4.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   39:              blk.4.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   40:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   41:            blk.4.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   42:              blk.4.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   43:          blk.4.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   44:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   45:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   46:              blk.5.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   47:              blk.5.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   48:              blk.5.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   49:         blk.5.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.5.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   51:              blk.5.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   52:          blk.5.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   53:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:              blk.6.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:              blk.6.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   57:              blk.6.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   58:         blk.6.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   59:            blk.6.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   60:              blk.6.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   61:          blk.6.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   62:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   63:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   64:              blk.7.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   65:              blk.7.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   66:              blk.7.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   67:         blk.7.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:            blk.7.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   69:              blk.7.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   70:          blk.7.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   71:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:              blk.8.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:              blk.8.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   75:              blk.8.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   76:         blk.8.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   77:            blk.8.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   78:              blk.8.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   79:          blk.8.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   80:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   81:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   82:              blk.9.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   83:              blk.9.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   84:              blk.9.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   85:         blk.9.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:            blk.9.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   87:              blk.9.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   88:          blk.9.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   89:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:             blk.10.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:             blk.10.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   93:             blk.10.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   94:        blk.10.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   95:           blk.10.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   96:             blk.10.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   97:         blk.10.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   98:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   99:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  100:             blk.11.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  101:             blk.11.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  102:             blk.11.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  103:        blk.11.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.11.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  105:             blk.11.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  106:         blk.11.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  107:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:             blk.12.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:             blk.12.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  111:             blk.12.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  112:        blk.12.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  113:           blk.12.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  114:             blk.12.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  115:         blk.12.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  116:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  117:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  118:             blk.13.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  119:             blk.13.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  120:             blk.13.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  121:        blk.13.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.13.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  123:             blk.13.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  124:         blk.13.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  125:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:             blk.14.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:             blk.14.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  129:             blk.14.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  130:        blk.14.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  131:           blk.14.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  132:             blk.14.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  133:         blk.14.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  134:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  135:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  136:             blk.15.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  137:             blk.15.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  138:             blk.15.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  139:        blk.15.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.15.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  141:             blk.15.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  142:         blk.15.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  143:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:             blk.16.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:             blk.16.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  147:             blk.16.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  148:        blk.16.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  149:           blk.16.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  150:             blk.16.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  151:         blk.16.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  152:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  153:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  154:             blk.17.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  155:             blk.17.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  156:             blk.17.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  157:        blk.17.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.17.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  159:             blk.17.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  160:         blk.17.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  161:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:             blk.18.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:             blk.18.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  165:             blk.18.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  166:        blk.18.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  167:           blk.18.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  168:             blk.18.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  169:         blk.18.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  170:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  171:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  172:             blk.19.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  173:             blk.19.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  174:             blk.19.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  175:        blk.19.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.19.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  177:             blk.19.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  178:         blk.19.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  179:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:             blk.20.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:             blk.20.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  183:             blk.20.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  184:        blk.20.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  185:           blk.20.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  186:             blk.20.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  187:         blk.20.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  188:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  189:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  190:             blk.21.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  191:             blk.21.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  192:             blk.21.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  193:        blk.21.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  194:           blk.21.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  195:             blk.21.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  196:         blk.21.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  197:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  198:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  199:             blk.22.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:             blk.22.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:             blk.22.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:        blk.22.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:           blk.22.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  204:             blk.22.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  205:         blk.22.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  206:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  207:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  208:             blk.23.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:             blk.23.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  210:             blk.23.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:        blk.23.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:           blk.23.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  213:             blk.23.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  214:         blk.23.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  215:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  216:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  217:             blk.24.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:             blk.24.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  219:             blk.24.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:        blk.24.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:           blk.24.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  222:             blk.24.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  223:         blk.24.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  224:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  225:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  226:             blk.25.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.25.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  228:             blk.25.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:        blk.25.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:           blk.25.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  231:             blk.25.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  232:         blk.25.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  233:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  234:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  235:             blk.26.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.26.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  237:             blk.26.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:        blk.26.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:           blk.26.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  240:             blk.26.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  241:         blk.26.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  242:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  243:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  244:             blk.27.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.27.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  246:             blk.27.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:        blk.27.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:           blk.27.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  249:             blk.27.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  250:         blk.27.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  251:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  252:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  253:             blk.28.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.28.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  255:             blk.28.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:        blk.28.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:           blk.28.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  258:             blk.28.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  259:         blk.28.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  260:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  261:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  262:             blk.29.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.29.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  264:             blk.29.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:        blk.29.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:           blk.29.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  267:             blk.29.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  268:         blk.29.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  269:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  270:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  271:             blk.30.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.30.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  273:             blk.30.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:        blk.30.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:           blk.30.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  276:             blk.30.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  277:         blk.30.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  278:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  279:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  280:             blk.31.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.31.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  282:             blk.31.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:        blk.31.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:           blk.31.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  285:             blk.31.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  286:         blk.31.ffn_down_t.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  287:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  288:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  289:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  290:                    output.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor  291:                 blk.0.fc1.weight f16      [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  292:                 blk.0.fc2.weight f16      [  1024, 11008,     1,     1 ]
llama_model_loader: - tensor  293:                 blk.1.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  294:                 blk.1.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  295:                 blk.2.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  296:                 blk.2.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  297:                 blk.3.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  298:                 blk.3.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  299:                 blk.4.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  300:                 blk.4.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  301:                 blk.5.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  302:                 blk.5.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  303:                 blk.6.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  304:                 blk.6.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  305:                 blk.7.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  306:                 blk.7.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  307:                 blk.8.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  308:                 blk.8.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  309:                 blk.9.fc1.weight f16      [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  310:                 blk.9.fc2.weight f16      [  1024, 11008,     1,     1 ]
llama_model_loader: - tensor  311:                blk.10.fc1.weight f16      [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  312:                blk.10.fc2.weight f16      [  1024, 11008,     1,     1 ]
llama_model_loader: - tensor  313:                blk.11.fc1.weight f16      [  4096,  1024,     1,     1 ]
llama_model_loader: - tensor  314:                blk.11.fc2.weight f16      [  1024, 11008,     1,     1 ]
llama_model_loader: - tensor  315:                blk.12.fc1.weight f16      [  4096,  1280,     1,     1 ]
llama_model_loader: - tensor  316:                blk.12.fc2.weight f16      [  1280, 11008,     1,     1 ]
llama_model_loader: - tensor  317:                blk.13.fc1.weight f16      [  4096,  1280,     1,     1 ]
llama_model_loader: - tensor  318:                blk.13.fc2.weight f16      [  1280, 11008,     1,     1 ]
llama_model_loader: - tensor  319:                blk.14.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  320:                blk.14.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  321:                blk.15.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  322:                blk.15.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  323:                blk.16.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  324:                blk.16.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  325:                blk.17.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  326:                blk.17.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  327:                blk.18.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  328:                blk.18.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - tensor  329:                blk.19.fc1.weight f16      [  4096,  1792,     1,     1 ]
llama_model_loader: - tensor  330:                blk.19.fc2.weight f16      [  1792, 11008,     1,     1 ]
llama_model_loader: - tensor  331:                blk.20.fc1.weight f16      [  4096,  1792,     1,     1 ]
llama_model_loader: - tensor  332:                blk.20.fc2.weight f16      [  1792, 11008,     1,     1 ]
llama_model_loader: - tensor  333:                blk.21.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  334:                blk.21.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  335:                blk.22.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  336:                blk.22.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  337:                blk.23.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  338:                blk.23.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  339:                blk.24.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  340:                blk.24.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  341:                blk.25.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  342:                blk.25.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  343:                blk.26.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  344:                blk.26.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  345:                blk.27.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  346:                blk.27.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  347:                blk.28.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  348:                blk.28.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  349:                blk.29.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  350:                blk.29.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  351:                blk.30.fc1.weight f16      [  4096,  2048,     1,     1 ]
llama_model_loader: - tensor  352:                blk.30.fc2.weight f16      [  2048, 11008,     1,     1 ]
llama_model_loader: - tensor  353:                blk.31.fc1.weight f16      [  4096,  1536,     1,     1 ]
llama_model_loader: - tensor  354:                blk.31.fc2.weight f16      [  1536, 11008,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                          general.file_type u32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  290 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 7.57 B
llm_load_print_meta: model size       = 14.11 GiB (16.00 BPW)
llm_load_print_meta: general.name   = syx
llm_load_print_meta: BOS token = 1 '`<s>`'
llm_load_print_meta: EOS token = 2 '`</s>`'
llm_load_print_meta: UNK token = 0 '`<unk>`'
llm_load_print_meta: PAD token = 0 '`<unk>`'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size =    0.13 MB
llm_load_sparse_model_tensors: using ROCm for GPU acceleration
llm_load_sparse_model_tensors: offloaded layers from VRAM budget(24853348352 bytes): 33/32
llm_load_sparse_model_tensors: mem required  = 14446.15 MB
llm_load_sparse_model_tensors: VRAM used: 5939.52 MB
....................................................................................................
invoking powerinfer Python module to generate gpu split for 17506.48 MiB of VRAM
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    1:                 blk.0.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    blk.1.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    3:                 blk.1.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    4:                    blk.2.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    5:                 blk.2.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    6:                    blk.3.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    7:                 blk.3.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    8:                    blk.4.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    9:                 blk.4.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   10:                    blk.5.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   11:                 blk.5.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   12:                    blk.6.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   13:                 blk.6.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   14:                    blk.7.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   15:                 blk.7.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   16:                    blk.8.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   17:                 blk.8.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   18:                    blk.9.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   19:                 blk.9.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   20:                   blk.10.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   21:                blk.10.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   22:                   blk.11.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   23:                blk.11.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   24:                   blk.12.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   25:                blk.12.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   26:                   blk.13.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   27:                blk.13.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   28:                   blk.14.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   29:                blk.14.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   30:                   blk.15.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   31:                blk.15.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   32:                   blk.16.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   33:                blk.16.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   34:                   blk.17.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   35:                blk.17.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   36:                   blk.18.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   37:                blk.18.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   38:                   blk.19.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   39:                blk.19.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   40:                   blk.20.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   41:                blk.20.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   42:                   blk.21.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   43:                blk.21.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   44:                   blk.22.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   45:                blk.22.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   46:                   blk.23.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   47:                blk.23.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   48:                   blk.24.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   49:                blk.24.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   50:                   blk.25.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   51:                blk.25.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   52:                   blk.26.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   53:                blk.26.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   54:                   blk.27.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   55:                blk.27.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   56:                   blk.28.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   57:                blk.28.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   58:                   blk.29.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   59:                blk.29.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (0.75 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (659.27 ms)
llm_load_gpu_split: offloaded 8256.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 580/836
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 6.91 MB
llama_new_context_with_model: VRAM scratch buffer: 5.34 MB
llama_new_context_with_model: total VRAM used: 22712.86 MB (model: 14195.52 MB, context: 261.34 MB)

system_info: n_threads = 1 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 1, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 32, n_predict = 128, n_keep = 0

llama_print_timings:        load time =    2472.15 ms
llama_print_timings:      sample time =      10.74 ms /   128 runs   (    0.08 ms per token, 11916.95 tokens per second)
llama_print_timings: prompt eval time =      81.02 ms /     5 tokens (   16.20 ms per token,    61.71 tokens per second)
llama_print_timings:        eval time =    4574.03 ms /   127 runs   (   36.02 ms per token,    27.77 tokens per second)
llama_print_timings:       total time =    4679.75 ms
Log end