llama.cpp Cli

@ `llama.cpp` 项目中的关于推理任务的核心的内容

从工具入口开始 llama-cli，它是 llama.cpp 项目的门面，最佳起点。

搜索 gguf 文件

获取 gguf 文件：https://huggingface.co/unsloth/Qwen3-1.7B-GGUF。HF 的 models 页面没有 .gguf 的文件，如何找到的？其实是有的，通过 tag=gguf 标签筛选：

https://huggingface.co/models ==> https://huggingface.co/models?library=gguf

另外，可以使用 ggml-org/gguf-my-repo 工具将模型权重转换为 GGUF 格式。

找到 gguf 文件后，在页面的 Files and versions 页面，点击 gguf 文件后的箭头，可以在线查看参数：包括模型 metadate、模型参数、tokenizer 参数、Tensors、结构每一层中每一个tensor的shape和精度。

将模型转换成gguf格式

使用 llama.cpp 提供的工具：convert_hf_to_gguf.py 完整步骤见 llama.cpp/tools/quantize/README.md

@ 找到一个可以在 4GB 的 Jetson orin 设备上加载的量化模型使用 llama-cli 将模型跑起来

目标：Qwen3-1.7B-Q4_K_M.gguf 大小 1.2GB。将文件放到 llama.cpp/models 目录下.

非交互模式：

llama-cli -m ../models/Qwen3-1.7B-Q4_K_M.gguf -no-cnv --prompt "Hello, tell me something about llama.cpp"

启动交互模式：

llama-cli -m ../models/Qwen3-1.7B-Q4_K_M.gguf

配置 GDB 文件

需要构建和编译 debug 版本，否则无法调试。

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "(gdb) Launch",
            "type": "cppdbg",
            "request": "launch",
            "program": "<workspace>/llama.cpp/build-debug/bin/llama-cli",
            "args": ["-m", "<workspace>/llama.cpp/models/Qwen3-1.7B-Q4_K_M.gguf", 
                     "-no-cnv", "--prompt", "What is the result of 1 + 1 in Math?", 
                     "--no-warmup"],
            "stopAtEntry": true,
            "cwd": "${fileDirname}",
            "environment": [],
            "externalConsole": false,
            "MIMode": "gdb",
            "setupCommands": [
                {
                    "description": "Enable pretty-printing for gdb",
                    "text": "-enable-pretty-printing",
                    "ignoreFailures": true
                }
            ]
        }
    ]
}

Debug mode 报错

debug 模式下报错：assert(node->buffer->buft == ggml_backend_cuda_buffer_type(cuda_ctx->device));

llama.cpp 的 GGML CUDA 后端假设模型的缓冲区 node->buffer 与当前 CUDA 设备 cuda_ctx->device 一致。我的 Jetson ORIN 实际上不一致。处理方法：将 assert 注释掉。

done

推理过程 in code

pipline 的big picture，主要涉及以下几个步骤：【todo，添加 build_graph 和 graph compute】

main() {

    // load model and adapt Lora
    common_init_result llama_init = common_init_from_params(params); {
        // load model
        llama_model * model = llama_model_load_from_file(params.model.path.c_str(), mparams); 
        {
            // print KV pairs and  tensor type
            const int status = llama_model_load(path_model, splits, *model, params);
            {
                // load model 对象
                llama_model_loader ml(...)
                // load vocab
                model.load_vocab(ml);
                // printinfo
                model.print_info();
                // load tensors 和 CPU mapped model buffer size
                model.load_tensors(ml)
            }
        }
        // load vocab
        const llama_vocab * vocab = llama_model_get_vocab(model);

        auto cparams = common_context_params_to_llama(params);
        // llama context 初始化
        llama_context * lctx = llama_init_from_model(model, cparams);

        common_set_adapter_lora(lctx, params.lora_adapters);
    }

    model = llama_init.model.get();
    ctx = llama_init.context.get();

    auto * mem = llama_get_memory(ctx);
    const llama_vocab * vocab = llama_model_get_vocab(model);
    auto chat_templates = common_chat_templates_init(model, params.chat_template);

    // thread pool 相关操作
    // ...
    // 系统信息
    common_params_get_system_info(params).c_str()

    // tokenize prompt 输入
    prompt = params.prompt;
    embd_inp = common_tokenize(ctx, prompt, true, true);

    // sampler 通过 llama_sampler_chain_add 添加各个采样器
    smpl = common_sampler_init(model, sparams); {
        const llama_vocab * vocab = llama_model_get_vocab(model);
        auto * result = new common_sampler {... llama_sampler_chain_init(lparams),...}
        llama_sampler_chain_add(result->chain...)
        llama_sampler_chain_add(result->chain...)
    }

    // 逐个token开始生成
    while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
        ....
        //  forward pass 的实际计算，每生成一个token 前都需要一次 forward 计算
        llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))  
        ....
    }

    // clear up
}

Tool：图像化 model 结构

在 build_graph 步骤中添加 code：

    {
        static int call_count = 0;
        if (call_count == 0) { // 仅第一次调用输出
            ggml_graph_dump_dot(gf, NULL, "llama-after-graph-compute.dot");
        }
        call_count++;
    }

使用 graphviz 查看生成的图,安装依赖

sudo apt update
sudo apt install graphviz

将 .dot 文件转换为 image:

dot -Tpng llama-after-graph-compute.dot -o llama-after-graph-compute.png

或转换为 SVG 适合复杂的计算图：

dot -Tsvg llama-after-graph-compute.dot -o llama-after-graph-compute.svg

文件太大，图太复杂，总是不能加载。

@ llama.cpp 项目中的关于推理任务的核心的内容#

搜索 gguf 文件#

将模型转换成gguf格式#

@ 找到一个可以在 4GB 的 Jetson orin 设备上加载的量化模型 使用 llama-cli 将模型跑起来#

配置 GDB 文件#

Debug mode 报错#

推理过程 in code#

Tool：图像化 model 结构#