Glossary
- autoregressive models -
- autograd - “provides functions to automatically compute gradients in dynamic computational graphs“
- byte pair encoding - a way of encoding an entire vocabulary out of small chunks of words—first letters, then common combinations of letters. This allows the encoder to encode even made-up words without having to resort to the special context token.
-
- chain rule - “chain rule is a way to compute gradients of a loss function with respect to the model’s parameters in a computation graph. This provides the information needed to update each parameter in a way that minimizes the loss function, which serves as a proxy for measuring the model’s performance, using a method such as gradient descent.” (Raschka, appendix A.4)
- computational graph - “a directed graph that allows us to express and visualize mathematical expressions … lays out the sequence of calculations needed to compute the output of a neural netowrk.” e.g. computes the required gradients for backpropagation, the main training algorithm for neural networks. Per ChatGPT: “Computation graphs are like roadmaps that show how calculations flow from inputs to outputs in a model. The computation graph visually represents the flow of calculations in a machine learning model, including the forward pass in logistic regression.”
- data sampling pipeline -
- embedding - converting data into a vector format—a mathematical model that a computer can use to do things with non-numerical, categorical data like text. Different data formats, like text, image, and music, require different embedding models. An embedding is a mapping from discrete objects to points in a continuous vector space. In an embedding, each token becomes a specific point in the continuous vector space.
- emergent behaviors - LLM models are powerful because they exhibit emergent behaviors like the capacity for translation—behaviors the LLMs haven’t been specifically trained to do.
- input-target pairs -
- PyTorch - PyTorch has three components:
- tensor library - helps you use GPU acceleration for faster computations
- automatic differentiation engine - “autograd” - “provides functions to automatically compute gradients in dynamic computational graphs“
- neural network module - simplifies the implementation of neural networks. ChatGPT: “This module includes pre-defined layers (such as linear layers, convolutional layers, activation functions, etc.) and utilities for building and training neural networks. It allows you to define your neural network architecture as a sequence of layers and provides tools for managing parameters, computing losses, and optimizing models during training.”
- logistic regression forward pass - Per ChatGPT: “Logistic regression is a type of algorithm used for binary classification, which means it predicts one of two outcomes (like yes/no, true/false). The goal is to find a line (or boundary) that separates the two classes based on the input data.” Logical regression algorithms have four elements:
- inputs / features: the elements of data you have
- weights / parameters: how important each feature is
- bias / intercept: a baseline prediction that the model starts with
- prediction: multiply the feature * weight, add the bias, then apply a sigmoid function, which gives you a probability score between 0 and 1.
- RAG - retrieval-augmented generation. An embedding model that embeds entire sentences or paragraphs to pull specific, detailed information when generating text.
- sigmoid function - a function (built into PyTorch, in this case) that allows you to compute the probability of a given output
- special context tokens - tokens that increase a model’s understanding of context or other relevant info in the text—markers for unknown words, document boundaries, etc. Common special context tokens include:
- end of text
- unknown words
- beginning of sequence
- end of sequence (basically, same as end of text tokens)
- padding
Note that GPT tokenizers don’t use tokens for words outside the model’s vocabulary. Instead, they use byte pair encoding—breaking down individual words into sub-word units.
- tensor - even after you have your encoding/decoding tokenizer, there’s still a need for representing words as continuous valued vectors. This is what tensors do (apparently). Sections 2.6 and A2.2 explain these things. Tensors are mathematical entities that “generalize vectors and matrices to potentially higher dimensions.”
- A scalar (just a single number) is a tensor of rank 0.
- A vector (single row or column of numbers) is a tensor of rank 1.
- A 2d matrix is a tensor of rank 2.
Tensors are multi-dimensional data containers, with each dimension representing a different feature.
- tokenization - the process of splitting text into individual tokens, which is a required preprocessing step for creating embeddings for an LLM
- transformer architecture - what makes the transformer mechanism so effective is the attention mechanism, which narrows the model’s focus to the whole input sequence while it generates text one word at a time. these initial transformers were used for machine translation.
- original transformers had both an encoder for parsing text and a decoder for generating text. GPT only uses the decoder portion, which keeps the model simpler.