llama-cli -m /home/junhui/workspace/llama.cpp/models/Qwen3-1.7B-Q4_K_M.gguf -no-cnv --prompt "What is the result of 1 + 1 in Math?" --no-warmup

从 meta Data 获取信息，llama_model_loader()

llama_model_load_from_file_impl: using device CUDA0 (Orin) - 696 MiB free
loaded meta data with 34 key-value pairs and 311 tensors from ../models/Qwen3-1.7B-Q4_K_M.gguf (version GGUF V3 (latest))
Dumping metadata keys/values. Note: KV overrides do not apply in this output.
- kv   0:                       general.architecture str              = qwen3
- kv   1:                               general.type str              = model
- kv   2:                               general.name str              = Qwen3 1.7B
- kv   3:                           general.basename str              = Qwen3
- kv   4:                         general.size_label str              = 1.7B
- kv   5:                            general.license str              = apache-2.0
- kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-1.7...
- kv   7:                   general.base_model.count u32              = 1
- kv   8:                  general.base_model.0.name str              = Qwen3 1.7B Base
- kv   9:          general.base_model.0.organization str              = Qwen
- kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-1.7...
- kv  11:                               general.tags arr[str,1]       = ["text-generation"]
- kv  12:                          qwen3.block_count u32              = 28
- kv  13:                       qwen3.context_length u32              = 40960   # 模型元数据中定义的最大上下文长度，表示
# 模型在训练或设计时支持的最大 token 数（包括输入 prompt 和生成输出）。 模型设计时的最大上下文长度，固定在 GGUF 元数据中，无法
# 修改。要改必须重新训练
- kv  14:                     qwen3.embedding_length u32              = 2048
- kv  15:                  qwen3.feed_forward_length u32              = 6144
- kv  16:                 qwen3.attention.head_count u32              = 16
- kv  17:              qwen3.attention.head_count_kv u32              = 8
- kv  18:                       qwen3.rope.freq_base f32              = 1000000.000000
- kv  19:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
- kv  20:                 qwen3.attention.key_length u32              = 128
- kv  21:               qwen3.attention.value_length u32              = 128
- kv  22:                       tokenizer.ggml.model str              = gpt2
- kv  23:                         tokenizer.ggml.pre str              = qwen2
- kv  24:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
- kv  25:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
- kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
- kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
- kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
- kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
- kv  30:               tokenizer.ggml.add_bos_token bool             = false
- kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
- kv  32:               general.quantization_version u32              = 2
- kv  33:                          general.file_type u32              = 15
- type  f32:  113 tensors
- type q4_K:  169 tensors
- type q6_K:   29 tensors

模型结构信息：

模型基于 Qwen3 架构，Qwen3 系列，具体是 Qwen3 1.7B 含有 1.7 billion 的参数。
模型的用于 “text-generation”。
这个模型含有28 个Transformer层。表达了模型的深度和计算复杂度。
支持最长的上下文，即模型处理的输入长度是 40K 个tokens。决定了模型可以处理的输入序列长度。
每个 token 表示为 2048维向量（embedding_length=2048）。表示了模型的表达能力和内存需求。它成为 hidden-size。
前馈网络的中间维度是 feed_forward_length=6144。表达了前向计算规模。
16 个注意力 head。用于捕获不同语义关系，影响模型表达能力和性能。
键值（Key-Value）注意力 head的数量，8 个。减少 KV 缓存内存，优化推理效率。
层归一化（RMSNorm）的 epsilon 值很小，0.0000001。越小表示推理越严格。
注意力机制中键（Key）& 值（Value）向量的维度都是 128。影响注意力计算的精度和内存，影响 KV-Cache 的存储空间。

分词器信息：

分词器模型类型，基于 GPT-2 的 BPE（type = LLAMA_VOCAB_TYPE_BPE）。算法，表示从文本到 token 的转换算法。
分词器的预处理方式，是基于 Qwen2 （pre_type = LLAMA_VOCAB_PRE_TYPE_QWEN2）的分词逻辑。
分词器的词汇表，包含 151,936 个 token。作用是映射文本到 token ID，影响分词效率和覆盖率。
分词得到的每个 token 对应一个 token 类型，共有151,936 个。它区分token的用途。完整的只有 0 或 1 。0 对应 GGML_TOKEN_NORMAL（普通 token），表示常规文本单元（如单词、子词），1 对应 GGML_TOKEN_CONTROL（控制 token），表示特殊标记（如 <|im_start|>, <|eos|>）。Qwen3 使用 GPT-2 风格的 BPE 分词器，所以只有普通token 和控制token。
分词合并规则 是 GPT2 的 BPE，得到相应的结果：["Ġ Ġ", "ĠĠ ĠĠ", ...]
结束标记的token id 是 eos_token_id: u32 = 151645。表示序列生成的结束。
填充（padding）标记的 token ID 是 151643。用于序列补齐，保持输入长度一致。
开始标记（Beginning of Sequence）的 token ID 与 padding 相同。
自动添加 BOS token 是 false。表示 Qwen3 不自动添加 BOS token。表示？
给出了聊天模板，引导模型处理对话，适合指令微调模型（如 Instruct）？
向量化版本和向量化文件类型。15 表示 Q4_K_M（4-bit K-quant，Medium 级别）。
最后这个模型有 311 个tensor (每一个 tensor 对应一个训练参数，113 个 f32, 169 个 q4_K, 29 个 q6_K，优化内存和性能)。

从 llama_model_loader::print_info() 中得到

print_info: file format = GGUF V3 (latest)     
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.19 GiB (5.03 BPW)

从 llama_vocab::impl::load() 函数中获得

load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 26   # 一共有26个特殊tokens，包括了上面的5个
load: token to piece cache size = 0.9311 MB  # token to piece 映射缓存的内存占用

在函数体中 tokenizer_model 有几种选择：no_vocab、none、llama、bert、gpt2、t5、rwkv、plamo2。tokenizer_pre 有更多的种类（70多种）：我的模型对应的是 Qwen2。GGUF 文件的 meta Data 中包含这些信息，code 中是根据不同的值，做了相应的处理。具体将：从 GGUF 文件的元数据（LLM_KV 结构体）中读取分词器相关信息。根据 tokenizer_model 和 tokenizer_pre 的不同，给特殊tokens 指定对应的ID。

这个 llama_vocab::impl::load() 函数功能是加载分词器词汇表。解析特殊 tokens，并存储在内存中。

作用：这些特殊 token 存储在内存中的一个专用数据结构（即 token to piece 映射），他们在推理中频繁被使用（如检查 <|im_end|> 停止生成），避免重复读取 GGUF 文件的词汇表。便于快速访问。具体讲，加速分词/解码时，即将输入文本转为 token ID（编码）或 token ID 转为文本（解码）时，缓存提供快速查找。

token 应该指的是 token id；piece 应该是字符串形式的 token。实现在 llama_vocab::token_to_piece。

输出的 5 个映射都是表示 EOG（End Of Generation）的特殊tokens，属于控制 tokens，用于停止生成、标记对话边界。该函数是通用的函数，以支持各种模型，分词模型（分词算法），他们各不相同，各自有不同的分词规则和特殊 tokens。，他们与其他特殊 tokens 组成了数量为 26 的特殊 tokens。

151643 ('<|endoftext|>')：结束token id，通常与 tokenizer.ggml.bos_token_id = 151643 和 padding_token_id = 151643 一致，表示序列开始或填充。
151645 ('<|im_end|>')：对话结束token id，与 tokenizer.ggml.eos_token_id = 151645 一致，用于结束生成（如 Qwen3 的聊天模板 <|im_start|>... <|im_end|>）。
151662 ('<|fim_pad|>')：文件补全（Fill-in-the-Middle）填充token id，用于代码补全任务（如 StarCoder 模型）。
151663 ('<|repo_name|>')：表示代码仓库名称的token id，可能用于多模态或代码生成任务。
151664 ('<|file_sep|>')：文件分隔符token id，区分多个文件或上下文。

llama_model_load() 中的 model.print_info(); 输出：

这些信息来自于 llama_model 对象的 hparams 成员。与load函数一样，print_info() 函数也是为了支持各种模型结构和分词模型，所以实现上也是许多路的分支判断。

print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: model type       = 1.7B
print_info: model params     = 2.03 B
print_info: general.name     = Qwen3 1.7B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256

load_tensors(ml) 加载 weight tensors

llama_model 对象的另外一个成员是 params（不同于 hparams）。load_tensors() 函数中的创建 weight tensor部分，包含了几十种arch的模型结构，每一种arch的层数，和每一层需要的 weight tensors。比如 Qwen3：

关键代码位置：llama_model.cpp load_tensors 中位置 case LLM_ARCH_QWEN3:

case LLM_ARCH_QWEN3:
    {
        tok_embd = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, 0);

        // output
        output_norm = create_tensor(tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd}, 0);
        output      = create_tensor(tn(LLM_TENSOR_OUTPUT,      "weight"), {n_embd, n_vocab}, TENSOR_NOT_REQUIRED);
        // if output is NULL, init from the input tok embed
        if (output == NULL) {
            output = create_tensor(tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}, TENSOR_DUPLICATED);
        }

        for (int i = 0; i < n_layer; ++i) {
            auto & layer = layers[i];

            layer.attn_norm = create_tensor(tn(LLM_TENSOR_ATTN_NORM, "weight", i), {n_embd}, 0);

            layer.wq = create_tensor(tn(LLM_TENSOR_ATTN_Q,   "weight", i), {n_embd, n_embd_head_k * n_head}, 0);
            layer.wk = create_tensor(tn(LLM_TENSOR_ATTN_K,   "weight", i), {n_embd, n_embd_gqa}, 0);
            layer.wv = create_tensor(tn(LLM_TENSOR_ATTN_V,   "weight", i), {n_embd, n_embd_gqa}, 0);
            layer.wo = create_tensor(tn(LLM_TENSOR_ATTN_OUT, "weight", i), {n_embd_head_k * n_head, n_embd}, 0);

            layer.attn_k_norm = create_tensor(tn(LLM_TENSOR_ATTN_K_NORM, "weight", i), {n_embd_head_k}, 0);
            layer.attn_q_norm = create_tensor(tn(LLM_TENSOR_ATTN_Q_NORM, "weight", i), {n_embd_head_k}, 0);

            layer.ffn_norm = create_tensor(tn(LLM_TENSOR_FFN_NORM, "weight", i), {n_embd}, 0);
            layer.ffn_gate = create_tensor(tn(LLM_TENSOR_FFN_GATE, "weight", i), {n_embd,   n_ff}, 0);
            layer.ffn_down = create_tensor(tn(LLM_TENSOR_FFN_DOWN, "weight", i), {  n_ff, n_embd}, 0);
            layer.ffn_up   = create_tensor(tn(LLM_TENSOR_FFN_UP,   "weight", i), {n_embd,   n_ff}, 0);
        }
    } break;

模型有311 个tensor，内部名类似：“blk.27.ffn_down.weight”，“blk.27.attn_v.weight” 等。信息在 llama_model 对象中的 pimpl 成员中，n_objects = 311。

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1217.35 MiB

创建 llama_context 对象

关键代码位置：文件 llama-context.cpp 中函数 llama_init_from_model 中的 auto * ctx = new llama_context(*model, params);

load model，load tensors，并将这些静态内容保存下来，避免每次推理都重新加载。load model，load tensors 之后，创建 context。context 用于管理推理运行时状态的核心数据结构。与静态的模型权重（llama_model）分开。

模型加载是全局操作；llama_context 是轻量级的（KV 缓存约数百 MB）

llama_model 是静态的，只包含架构和参数，无法处理动态输入或生成序列。需要 context 来执行实际推理任务（forward pass计算过程也在 context 类中： llama_context::decode）。context 功能包括：

动态配置推理参数 cparams ，独立于模型架构的，比如配置 n_ctx = 40960：设置最大上下文长度。
初始化 output_ids（初始化为-1）。
context 初始化硬件资源，初始化 backends，ggml_backend_cuda_guid()::guild 和 ggml_backend_cpu_guid()::guid，并计算 cpu/gpu 缓冲区大小: CPU output buffer size = 0.58 MiB , CUDA0 compute buffer size = 544.18 MiB。
支持分词和序列生成。分词和生成需要动态状态（例如当前 token 位置、KV 缓存），但 llama_model 只能提供静态词汇表。
每个 llama_context 是一个独立的推理会话，拥有自己的 KV 缓存、缓冲区和配置（params.n_ctx, n_batch）。这允许多个推理会话共享模型。
定义 Transformer 层的计算路径，

调用栈：

auto * ctx = new llama_context(*model, params);
llama_context * lctx = llama_init_from_model(model, cparams);
common_init_from_params(params);
main();

log信息：

llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096   # 推理时使用的实际上下文长度，由命令行参数 -c 或 --ctx-size 设置
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:        CPU KV buffer size =   448.00 MiB
llama_kv_cache_unified: size =  448.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_context:      CUDA0 compute buffer size =   544.18 MiB
llama_context:  CUDA_Host compute buffer size =    20.01 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 367 (with bs=512), 1 (with bs=1)

所以为什么需要 context？

llama_model 是静态的，包含权重（311 张量）和元数据（qwen3.context_length = 40960），但无法处理动态输入或生成过程。
llama_context 是动态的，管理推理状态（KV 缓存、token 序列）。

设置特殊 token 的 bias

在生成过程中，logit bias 调整 token 的生成概率。-inf 表示将这些 token 的生成概率设为 0，完全禁止生成这些 token

common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

系统信息和采样器参数

main() 中的 smpl = common_sampler_init(model, sparams); 通过 llama_sampler_chain_add 添加各个采样器。

main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 840489643
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

sampler chain 描述 llama.cpp 在生成 token 时应用的 sampler 顺序，即从原始模型输出 logits 到最终 token 选择的处理链。

logit-bias 是应用偏置（bias）调整特定 token 的 logits。增加或减少某些token的概率
penalties 是用惩罚机制（如频率惩罚、重复惩罚），降低重复 token 的概率
DRY 是Don’t Repeat Yourself 采样，减少重复模式
top-n-sigma 根据标准差（sigma）选择 top N 个 token，限制候选范围
top-K 采样选择概率最高的 K 个 token
typical 采样基于信息熵选择 token
top-P 采样选择累积概率达到 P 的 token 的集合
min-p 采样即最小概率采样，过滤概率低于阈值的 token
XTC Extreme Temperature Control
temp-ext: 温度扩展（temperature extension），调整分布平滑度
dist 是最终概率分布，归一化为概率（softmax），从中采样 token

每个采样器逐步过滤或调整概率分布，影响生成文本的多样性、连贯性和质量。如果没有指定，采样器会使用默认值。

sampler params: 是关于采样器的各个实际参数值。n_predict = -1 表示生成的token数量无限制。更多关于采样的参数，见 llama-cli --help。

sampler 创建过程类似：

    llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());
    llama_sampler_chain_add(smpl, llama_sampler_init_min_p(0.05f, 1));
    llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8f));
    llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));

token generate 关键位置

每一个单词是通过 while 循环，一个一个生成的。

while ((n_remain != 0 && !is_antiprompt) || params.interactive) {}

KAQ

KAQ：CPU_Mapped model buffer size 是什么意思？

关键代码位置：llama-model.cpp 文件中函数 load_tensors 中位置:

LLAMA_LOG_INFO("%s: %12s model buffer size = %8.2f MiB\n", __func__, ggml_backend_buffer_name(buf.get()), ggml_backend_buffer_get_size(buf.get()) / 1024.0 / 1024.0);

表示的是在加载 GGUF 文件的 tensors 时，分配的 CPU 映射缓冲区（CPU-mapped buffer）的总内存大小。这个缓冲区用于存储模型权重，并被映射到 CPU 地址空间。

KAQ：`n_ctx` 可以在运行时动态调整?

n_ctx 表示推理时支持的最大上下文长度（以 token 计），即模型一次能处理的输入和输出 token 总数。它不能在运行时调整，但可以在启动时调整，如在 cli 中通过 --ctx-size ：

./bin/llama-cli -m ../models/Qwen3-1.7B-Q4_K_M.gguf -no-cnv --prompt "what is the result of 1 + 1 in Math?" --ctx-size 2048

log 中 context 的 n_ctx = 2048，也就是 llama-context 可以在启动时调整参数。

KAQ：用户 prompt 输入序列的长度？

KAQ：hidden-size 用处是什么？

Tokenization之后对于每一个 Token 得到对应的 ID，hidden-size 表示每一个 Token ID 通过 embedding 后的长度，表示这个 Token 与上下文的关系信息，位置信息。为后续 Transformer 提供丰富的语义基础。

vocab-size 和 hidden-size 组成了 embedding 矩阵 $ W $

KAQ：注意力 head 和 KV 注意力头的关系是？

KAQ：LOG_DBG 如何 enable？

LOG_DBG： LOG_INF：默认开启 LOG_CNT：

从 meta Data 获取信息，llama_model_loader()#

从 llama_model_loader::print_info() 中得到#

从 llama_vocab::impl::load() 函数中获得#

llama_model_load() 中的 model.print_info(); 输出：#

load_tensors(ml) 加载 weight tensors#

创建 llama_context 对象#

设置特殊 token 的 bias#

系统信息和采样器参数#

token generate 关键位置#

KAQ#

KAQ：CPU_Mapped model buffer size 是什么意思？#

KAQ：n_ctx 可以在运行时动态调整?#

KAQ：用户 prompt 输入序列的长度？#

KAQ：hidden-size 用处是什么？#

KAQ：注意力 head 和 KV 注意力头 的关系是？#

KAQ：LOG_DBG 如何 enable？#