In order for large language models to perform their functionalities, there are certain parameters and configurations that can be updated to allow the responses from the models meet the objectives at hand.

These parameters are generally known as inference parameters as they are used to help during “inference” also know as the prediction step.

Max New Token

The max new token is what is used to limit the number of words that are generated. Usually, when the output tries to cut off the generation at the end of a sentence.

Next Token Selection Approaches

Next word selection deals with approach by which the next word can be selected. There are namely two ways:

Greedy
Random Weight Sampling

Greedy

With greedy selection, the idea is to select the next word/token with the highest probability. For example: “Dotun is a boy”, boy would always be selected if that is the highest word after “a” to describe a person. This could be helpful in situation whereby we want matter of fact answers.

Random (weight) sampling

With random sampling, this allows a bit of randomness in the output description. There are two parameters that can be configured in this case.

Top K: In this case, the next word is chosen from a limited set of “k” token options. So the idea being that we narrow the options the llm sees to the a few number of token options.
Top P: In this an output is selected based on a random probability threshold. So the idea here is that the top tokens are selected based on a cumulative probability

Temperature:

The temperature parameter determines the level of randomness in terms of selecting the next word. Sometimes we might want the llm output to be a bit more creative vs factual. For example, asking an llm to perform some creative efforts such as writing a poem, a love song etc might not need any sort of factual responses but more creativity. For temperature, the higher the temperature, the more creative the output, and the lower the temperature, the more factual.