Generative AI Blog — Dotun Opasina

In order for large language models to perform their functionalities, there are certain parameters and configurations that can be updated to allow the responses from the models meet the objectives at hand.

These parameters are generally known as inference parameters as they are used to help during “inference” also know as the prediction step.

Max New Token

The max new token is what is used to limit the number of words that are generated. Usually, when the output tries to cut off the generation at the end of a sentence.

Next Token Selection Approaches

Next word selection deals with approach by which the next word can be selected. There are namely two ways:

Greedy
Random Weight Sampling

Greedy

With greedy selection, the idea is to select the next word/token with the highest probability. For example: “Dotun is a boy”, boy would always be selected if that is the highest word after “a” to describe a person. This could be helpful in situation whereby we want matter of fact answers.

Random (weight) sampling

With random sampling, this allows a bit of randomness in the output description. There are two parameters that can be configured in this case.

Top K: In this case, the next word is chosen from a limited set of “k” token options. So the idea being that we narrow the options the llm sees to the a few number of token options.
Top P: In this an output is selected based on a random probability threshold. So the idea here is that the top tokens are selected based on a cumulative probability

Temperature:

The temperature parameter determines the level of randomness in terms of selecting the next word. Sometimes we might want the llm output to be a bit more creative vs factual. For example, asking an llm to perform some creative efforts such as writing a poem, a love song etc might not need any sort of factual responses but more creativity. For temperature, the higher the temperature, the more creative the output, and the lower the temperature, the more factual.

To understand generative AI, it is imperative to understand the technologies powering it. In short, those are Large Language Models (LLMs) which in simple terms imply that computers have been shown a lot of ways in language is written and is tested to guess how they will write theirs.

Use Cases

LLMs are used in sentiment analysis, translations, summarizing, and content creation for text-based, image and audio.

How they work

The way LLMs are used is that they take instructions called Prompts which are passed through the Model and offer a Completion. For example, you can have the following steps:

(user) Prompt : “What is the capital of the USA”

Model (Offers Completion): “Washington, DC”

History of LLMS

The mathematical idea of LLMs come from a paper by google called Attention is all you need which proposed the idea of the self-attention transformer architecture. The architecture is composed of two main layers: the encoder and the decoder.

Encoders

The encoder layer is basically in charge of taking inputs (a sentence), tokenizing the sentence (converting each word to a number) and converting them to embeddings (embeddings are large group of numeric values e.g 512 numbers or more that capture the underlying meaning behind the word/sentence).

Encoders can be used by itself in sentiment analysis, replacing a word in a sentence etc.

Decoders

Decoders are used to generate new information from embeddings, the idea of decoders are quite simple in the sense that, they are the generating arm of the transformer architecture. Decoders can be used in summarizing a text, answering questions and other form of generating new textual information. Tools like ChatGPT primarily are decoder based

Encoders and Decoders

Encoder and decoders together are used in what is described as sequence to sequence tasks such as translating languages.

Prompt Engineering (How to make LLMs work better)

To make LLMs respond better, using techniques to guide the AI model have been proposed. This is where prompt engineering comes in. Prompt engineering uses the of showing examples of how a question to be answered called In-Context Learning. There are 3 forms of in-context learning:

Zero-shot inference: no examples are given to the model
One-shot inference: one example is given to the model while asking the question.
Few-shot inference: Few examples are given to the model while the questions are being asked.

Generative AI Configurations