Skip to main content
  1. Posts/

LLM Parameters

·2325 words·5 mins

featured

This is an introduction to the LLM parameters.
#

Inference-Parameters
#

Temperature(Default 1 RANGE 0-2)

high temperature will enhance llm’s personal, less temperature will keep llm’s level

for instance: while temperature =1.0, llm’s behavior will more brave , 0.1 will be stable


A higher temperature setting can enhance the LLM’s creativity, while a lower temperature will maintain its consistency.

For example: When the temperature is set to 1.0, the LLM’s responses tend to be more adventurous; at 0.1, they are more conservative.

Max Generation Length(Default: Model-dependent,Range 1 to model’s maximum)

Set Max Generation Length can control llm’s input context’s length


Setting the Maximum Generation Length can control the length of the input context for the LLM."

Top-K(Default 0 Range 1 to vocabulary,Common Range 40,50)

Sampling is restricted to the top k highest-probability words. For instance, with k set to 50, the selection is made from among the 50 most likely words.

Top-p(Default 1 Range0-1 ,Common 0.9,0.95)

Sampling is conducted from the pool of words until their cumulative probability reaches p. For example, with p set to 0.9, the selection is made from among the words whose combined probabilities account for 90% of the total.

Repetion Penalty(Default 1 Range 1.0- 2.0)

To mitigate the likelihood of generating repetitive words, a higher repetition penalty Range can be applied, which effectively reduces the occurrence of redundant content in the text.

Beam Searth(Default 1,Range 1 to 10)

Beam Search is a heuristic search algorithm that maintains multiple candidate sequences during generation, ultimately selecting the most optimal text output. Key parameters for this algorithm include the Beam Width, which determines the number of candidates retained at each step.

Length Penalty

Length Penalty is a setting used in text generation to help the model create text that’s not too short or too long. It works by adjusting the score of a generated text based on its length. Example: Imagine you’re asking a computer to write a summary of a book. Without Length Penalty, the summary might be too short and miss important points, or it might be too long and include unnecessary details. By using Length Penalty, you can tell the computer to aim for a summary that’s just the right length, so it includes all the key information without extra fluff.

frequency_penalty(Default 0 Range -2.0- +2.0)

Description: Penalizes words based on their frequency in the generated text, reducing repetition. Example: When set to 1.0, common words are used less often, resulting in more diverse text.

presence_penalty(Default 0 Range -2.0- +2.0)

Description: Penalizes words that have already appeared in the text, encouraging the use of new words. Example: When set to 1.5, the model tends to use words that haven’t appeared yet, increasing text diversity.

stop(Range:String or String list)

Description: Specifies one or more tokens at which to stop generation. Example: When set to [".", “!”, “?”], the model will stop generating after producing these punctuation marks.

n (Default 1)

Description: Specifies how many completions to generate. Example: When set to 3, the model will generate 3 different responses.

best_of

Description: Generates multiple candidate results and returns the best n. Example: With n=2 and best_of=5, the model generates 5 candidate results and returns the 2 best ones.

logprobs

Description: Returns the most likely tokens and their log probabilities. Example: When set to 3, each generated token will be accompanied by the 3 most likely alternative options and their probabilities.

no_repeat_ngram_size

Description: Prevents repetition of specified length word groups. Example: When set to 3, the model avoids generating any consecutive repetition of three-word combinations.

[!CAUTION]

These parameters can be used together, but with some limitations:

  • Some parameters work against each other, like high temperature with strict top-k/top-p.
  • Beam search often overrides other sampling methods.
  • Using many complex parameters at once can slow down generation.
  • Not all models support every parameter.
  • Some parameters have specific Range ranges (e.g., temperature is usually 0-2).
  • Different tasks may need different parameter combinations.
  • Stop conditions usually take priority over other parameters.

Best practice:

  • Start with defaults
  • Adjust one parameter at a time
  • Test thoroughly
  • Keep notes on what works best参数名称 Parameter Name默认值 Default Value取值范围 Range描述 DescriptionTemperature 温度10-2控制输出的随机性和创造性。较高值增加创造性,较低值增加一致性。 Controls randomness and creativity of output. Higher values increase creativity, lower values increase consistency.Max Generation Length 最大生成长度模型相关 Model-dependent1 到模型最大值 1 to model’s maximum控制LLM输入上下文的长度。 Controls the length of the input context for the LLM.Top-K 前K个01 到词汇表大小 1 to vocabulary size限制采样到概率最高的K个词。常用范围40-50。 Restricts sampling to top K highest-probability words. Common range 40-50.Top-p 前p个10-1从累积概率达到p的词池中采样。常用值0.9, 0.95。 Samples from words until cumulative probability reaches p. Common values 0.9, 0.95.Repetition Penalty 重复惩罚11.0-2.0降低重复词的生成概率,减少冗余内容。 Reduces likelihood of generating repetitive words, decreasing redundant content.Beam Search 束搜索11-10保持多个候选序列,选择最优输出。关键参数包括Beam Width。 Maintains multiple candidate sequences, selects optimal output. Key parameter includes Beam Width.Length Penalty 长度惩罚1.0通常0.0到2.0 Usually 0.0 to 2.0根据长度调整生成文本的分数,控制输出长度。 Adjusts the score of generated text based on its length, controlling output length.frequency_penalty 频率惩罚0-2.0 到 2.0 -2.0 to 2.0根据词频对令牌进行惩罚。 Penalizes tokens based on their frequency.presence_penalty 存在惩罚0-2.0 到 2.0 -2.0 to 2.0根据是否出现过对令牌进行惩罚。 Penalizes tokens based on their presence.stop 停止无 None字符串或字符串列表 String or String list指定停止生成的标记。 Specifies tokens at which to stop generation.n 数量1正整数 Positive integer生成多少个完成结果。 Number of completions to generate.best_of 最佳数量1正整数,大于等于n Positive integer, ≥ n生成多个候选结果并返回最佳的n个。 Generate multiple candidates and return the best n.logprobs 对数概率null非负整数(通常0-5) Non-negative integer (usually 0-5)返回最可能的令牌及其对数概率。 Return log probabilities of the most likely tokens.no_repeat_ngram_size 不重复n元组大小0正整数 Positive integer防止重复指定长度的n元组。 Prevent repetition of n-grams of specified length.

Tranining parameters
#

  • **Learning Rate:**Explanation: Controls the step size at each iteration while moving toward a minimum of the loss function. Example: 0.0001 (1e-4)
  • **Batch Size:**Explanation: The number of training examples used in one iteration. Example: 32, 64, 128
  • Optimizer: Explanation: Algorithm used to update the model’s weights. Example: Adam, SGD, RMSprop
  • **Epochs:**Explanation: The number of complete passes through the entire training dataset. Example: 10, 50, 100
  • **Weight Initialization:**Explanation: Method used to set the initial random weights of the neural network. Example: Xavier initialization, He initialization
  • **Regularization:**Explanation: Techniques to prevent overfitting. Example: L2 regularization (weight decay = 0.01), Dropout (rate = 0.1)
  • **Learning Rate Scheduler:**Explanation: Strategy to adjust the learning rate during training. Example: StepLR (step_size=30, gamma=0.1), CosineAnnealingLR
  • **Model Architecture:**Explanation: The structure and size of the neural network. Example: Transformer with 12 layers, 768 hidden size, 12 attention heads
  • **Sequence Length:**Explanation: The maximum length of input sequences. Example: 512, 1024, 2048 tokens
  • **Warmup Steps:**Explanation: Number of steps to gradually increase the learning rate at the start of training. Example: 1000 steps
  • **Gradient Clipping:**Explanation: Technique to prevent exploding gradients by limiting their magnitude. Example: max_norm=1.0
  • Mixed Precision Training: Explanation: Using lower precision (e.g., float16) to speed up training and reduce memory usage. Example: Enabled with float16
  • Distributed Training Strategy: Explanation: Method for training across multiple GPUs or nodes. Example: Data Parallel, Model Parallel
  • **Attention Dropout:**Explanation: Dropout rate specifically for attention layers. Example: 0.1
  • **Activation Function:**Explanation: Non-linear function applied to neuron outputs. Example: ReLU, GELUParameterExplanationExampleLearning Rate 学习率Controls the step size at each iteration while moving toward a minimum of the loss function. 控制每次迭代时参数更新的步长。0.0001 (1e-4)Batch Size 批量大小The number of training examples used in one iteration. 每次迭代中使用的训练样本数量。32, 64, 128Optimizer 优化器Algorithm used to update the model’s weights. 用于更新模型权重的算法。Adam, SGD, RMSpropEpochs 训练轮数The number of complete passes through the entire training dataset. 完整遍历整个训练数据集的次数。10, 50, 100Weight Initialization 权重初始化Method used to set the initial random weights of the neural network. 设置神经网络初始随机权重的方法。Xavier initialization He initializationRegularization 正则化Techniques to prevent overfitting. 防止过拟合的技术。L2 regularization (weight decay = 0.01) Dropout (rate = 0.1)Learning Rate Scheduler 学习率调度器Strategy to adjust the learning rate during training. 在训练过程中调整学习率的策略。StepLR (step_size=30, gamma=0.1) CosineAnnealingLRModel Architecture 模型架构The structure and size of the neural network. 神经网络的结构和大小。Transformer with 12 layers, 768 hidden size, 12 attention headsSequence Length 序列长度The maximum length of input sequences. 输入序列的最大长度。512, 1024, 2048 tokensWarmup Steps 预热步数Number of steps to gradually increase the learning rate at the start of training. 训练开始时逐步增加学习率的步数。1000 stepsGradient Clipping 梯度裁剪Technique to prevent exploding gradients by limiting their magnitude. 通过限制梯度幅度来防止梯度爆炸的技术。max_norm=1.0Mixed Precision Training 混合精度训练Using lower precision (e.g., float16) to speed up training and reduce memory usage. 使用较低精度(如float16)来加速训练并减少内存使用。Enabled with float16Distributed Training Strategy 分布式训练策略Method for training across multiple GPUs or nodes. 跨多个GPU或节点进行训练的方法。Data Parallel, Model ParallelAttention Dropout 注意力丢弃率Dropout rate specifically for attention layers. 专门用于注意力层的丢弃率。0.1Activation Function 激活函数Non-linear function applied to neuron outputs. 应用于神经元输出的非线性函数。ReLU, GELU