2025-02-01
“In three words: deep learning worked.”
“In 15 words: deep learning worked, got predictably better with scale, and we dedicated increasing resources to it.”
“That’s really it; humanity discovered an algorithm that could really, truly learn any distribution of data (or really, the underlying “rules” that produce any distribution of data). To a shocking degree of precision, the more compute and data available, the better it gets at helping people solve hard problems. I find that no matter how much time I spend thinking about this, I can never really internalize how consequential it is.”
- Sam Altman, [The Intelligence Age, 23-09-2024]
“
Artificial Intelligence that has learned to create data such as:
images, text, audio, videos, etc…
”
- Text (Large Language Models)
- Images (and Video) (Diffusion Models)
- Audio (Text-to-Speech)
- Tabular Data (Synthetic data)
Typically meant in the context of content-creation
This is not new: thispersondoesnotexist.com (2018)
But became significantly more powerfull, flexible, practical, and “mainstream” around ~2022
\[P(token_n|token_{n-1}, \cdots, token_1)\]
A token: a single character, a combination of characters, or a word
\[P(token_n|token_{n-1}, \cdots, token_1)\]
This is nothing new, your phone does something similair:
\[ \begin{array}{c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c} \text{A} & \text{dry} & \text{well!} & \text{Well} & \text{done!} \\ \begin{pmatrix} 1\\ 0\\ 0\\ 0 \end{pmatrix} & \begin{pmatrix} 0\\ 1\\ 0\\ 0 \end{pmatrix} & \begin{pmatrix} 0\\ 0\\ 1\\ 0 \end{pmatrix} & \begin{pmatrix} 0\\ 0\\ 1\\ 0 \end{pmatrix} & \begin{pmatrix} 0\\ 0\\ 0\\ 1 \end{pmatrix} \end{array} \]
\[ \begin{array}{c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c} \text{A} & \text{dry} & \text{well!} & \text{Well} & \text{done!} \\ \begin{pmatrix} \phantom{-}0.33\\ -0.51\\ \phantom{-}0.83\\ \phantom{-}0.12 \end{pmatrix} & \begin{pmatrix} \phantom{-}0.97\\ -0.15\\ -0.11\\ \phantom{-}0.85 \end{pmatrix} & \begin{pmatrix} \phantom{-}0.94\\ \phantom{-}0.79\\ -0.34\\ \phantom{-}0.35 \end{pmatrix} & \begin{pmatrix} \phantom{-}0.94\\ \phantom{-}0.79\\ -0.34\\ \phantom{-}0.35 \end{pmatrix} & \begin{pmatrix} -0.02\\ \phantom{-}0.69\\ \phantom{-}0.54\\ -0.12 \end{pmatrix} \end{array} \]
King - Man + Woman =
Queen
\[ \begin{array}{c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c} \text{A} & \text{dry} & \text{well!} & \text{Well} & \text{done!} \\ \begin{pmatrix} \phantom{-}0.33\\ -0.51\\ \phantom{-}0.83\\ \phantom{-}0.12 \end{pmatrix} & \begin{pmatrix} \phantom{-}0.97\\ -0.15\\ -0.11\\ \phantom{-}0.85 \end{pmatrix} & \color{red}{\begin{pmatrix} \phantom{-}0.94\\ \phantom{-}0.79\\ -0.34\\ \phantom{-}0.35 \end{pmatrix}} & \color{red}{\begin{pmatrix} \phantom{-}0.94\\ \phantom{-}0.79\\ -0.34\\ \phantom{-}0.35 \end{pmatrix}} & \begin{pmatrix} -0.02\\ \phantom{-}0.69\\ \phantom{-}0.54\\ -0.12 \end{pmatrix} \end{array} \]
King - Man + Woman = Queen
\[ \begin{array}{c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c@{\hspace{0.3cm}}c} \text{A} & \text{dry} & \text{well!} & \text{Well} & \text{done!} \\ \begin{pmatrix} \phantom{-}0.33\\ -0.51\\ \phantom{-}0.83\\ \phantom{-}0.12 \end{pmatrix} & \begin{pmatrix} \phantom{-}0.97\\ -0.15\\ -0.11\\ \phantom{-}0.75 \end{pmatrix} & \begin{pmatrix} \phantom{-}0.54\\ -0.79\\ -0.34\\ \phantom{-}0.22 \end{pmatrix} & \begin{pmatrix} -0.41\\ \phantom{-}0.79\\ \phantom{-}0.17\\ \phantom{-}0.84 \end{pmatrix} & \begin{pmatrix} -0.02\\ \phantom{-}0.69\\ \phantom{-}0.54\\ -0.12 \end{pmatrix} \end{array} \]
\[
Attention \sim Query \cdot Key^{T}
\]
\[
\mathrm{Output\ embedding} \sim \mathrm{Softmax}( Query \cdot Key^{T}) Value
\]
Optimal model size grows smoothly with the loss target and compute \(\mathrm{budget^1}\)
For optimally compute-efficient training, most of the increase should go towards increased model size1
Models just kept on growing, credit: Julien Simon, Huggingface
INPUT:
Explain what a Large Language Model is.
OUTPUT:
Explain what a transformer model is.
Explain what a tokenizer is.
Explain what gradient descent is.
INPUT:
Explain what a Large Language Model is.
HUMAN EXAMPLE:
A Large Language Model is a foundational language model with typically billions of parameters. These models have become popular in recent years because of their ease of use combined with impressive performance across the board on NLP-tasks
Foundational language model that:
think
in `multiple directions’ like we do!Thinking
only has a fast mode, no thinking fast and slow.Be Carefull
Feature/Aspect | GPT-3.5 | GPT-4(o) |
---|---|---|
Problem-Solving | Standard problem-solving skills | Greater accuracy and broader knowledge |
Creativity | Standard | More creative and collaborative |
Reasoning Capabilities | Standard reasoning | Advanced reasoning capabilities |
Context Length | 8K tokens context | 32K to 128K tokens context |
Multimodality | None | - Vision - Speech in - Speech out - Image generation |
Extra features | None | - RAG - Advanced Data Analysis |
Factual Accuracy | Standard | more likely to produce factual responses |
Command Adherence | Standard | More likely to adhere to prompt |
LLMs and ChatGPT