Skip to content

Latest commit

 

History

History
543 lines (481 loc) · 46.2 KB

TEXT.md

File metadata and controls

543 lines (481 loc) · 46.2 KB
Table of Contents

My best timeline of GPT efforts is listed here: https://lspace.swyx.io/p/open-source-ai

Big list of Text Models

Datasets

see [[DATASETS]]

Language Models

Model Evolution/Family trees

Direct GPT alternatives

  • Eleuther's GPT-J-6B, GPT-NeoX
  • Google
    • PaLM 570B
      • https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/
      • FLAN-T5
      • LaMDA
        • https://twitter.com/debarghya_das/status/1638356068870033408
        • LaMDA [Feb 2022] is a 137B param model trained on 2.81T tokens on 1024 v3 TPUs for 57.7 days. At $3.22/hr (v4 price), that costs ~$4.5M.
        • RLAIF - trains its own model to predict scores for candidates based on labeled data on attributes: —Sensibleness —Specificity —Interestingness —Safety
        • Accuracy Uses both retrieval-augmented generation and Toolformer techniques. It maintains a toolset (Search, Calculator...) and performs a "Research" task of predicting a toolset and a query for a response. It loops over this response and research phase up to 4 times.
        • We know: —Smaller than GPT-3 137B vs 175B params and older: early 2022 —Trained on way more tokes than most (2.8T vs 1.4T) —Does not rely on the model to memorize facts, uses tools like Search —Costs ~$1-4M to train
  • Yandex YaLM 100B https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6
    • It took us 65 days to train the model on a pool of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources.
  • BLOOM 176B
  • Tsinghua GLM-130B
    • outperforms OpenAI's GPT-3 175B and Google's PALM 540B on critical benchmarks. AND it's open sourced, which means — you can run this model on your own machine, for free.
    • only trained on 400B tokens (compared to 1.2T tokens for Chinchilla's 70B parameters)
    • https://twitter.com/AndyChenML/status/1611529311390949376?s=20
  • Meta
  • Databricks Dolly
    • 1.0 https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
      • A critical step in the creation of Dolly 1.0, or any instruction following LLMs, is to train the model on a dataset of instruction and response pairs. Dolly 1.0 was trained for $30 using a dataset that the Stanford Alpaca team had created using the OpenAI API. That dataset contained output from ChatGPT, and as the Stanford team pointed out, the terms of service seek to prevent anyone from creating a model that competes with OpenAI. So, unfortunately, the answer to this common question was, “probably not!” As far as we know, all the existing well-known instruction-following models (AlpacaKoalaGPT4AllVicuna) suffer from this limitation, prohibiting commercial use.
    • 2.0 https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
      • We knew from the OpenAI research paper that the original InstructGPT model was trained on a dataset consisting of 13,000 demonstrations of instruction following behavior
      • Databricks has over 5,000 employees who are very interested in LLMs. So we thought we could crowdsource among them to create an even higher quality dataset than the 40 labelers had created for OpenAI.
      • We set up a contest, where the top 20 labelers would get a big award. We also outlined 7 very specific tasks:
        • Open Q&A: For instance, “Why do people like comedy movies?” or “What is the capital of France?” In some cases, there’s not a correct answer, and in others, it requires drawing on knowledge of the world at large.
        • Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?”
        • Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage.
        • Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distill it to a short summary.
        • Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”.
        • Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review.
        • Creative writing: This task would include things like writing a poem or a love letter.

LLaMa and variants

Brief history: https://agi-sphere.com/llama-models/ https://news.ycombinator.com/item?id=35736872

Other text models

https://pbs.twimg.com/media/Fknc4rkX0AMY5RF?format=jpg&name=large

GPT4

gpt4 api

gpt4 capabilities

Applications

GPT3 applications:

wiring up LLMs to python https://twitter.com/karpathy/status/1593081701454204930?s=20&t=2ra2Yfz0NFSbfJ_IGixNjA

GPT4 applications

Top GPT Prompt Engineering Reads

How GPT works

Don't call it generative

Specialized language models

GPT Products

GPT tooling

mostly from https://twitter.com/goodside/status/1588247865503010816

dealing with GPT context size

Ethical issues

Flan-T5

Misc Text AI

Karpathy analogies

  • https://twitter.com/karpathy/status/1644183721405464576
  • The analogy between GPTs of today to the CPUs of early days of computing are interesting. GPT is a funny kind of programmable text computer. Have to think through it more !
  • Memory
    • GPT-4 RAM is ~log2(50K vocab size)*(32K context length)/(8 bits/byte) ~= 64kB, roughly a Commodore64. Just as then, optimizing this precious resource is critical. GPT registers are the residual stream. There are d_model of them, e.g. GPT-3 has ~12K registers. VLIW architecture vibes.
  • CPU
    • The LOAD instruction is the Attention mechanism, except it can address by both location and/or content.
    • The STORE instruction is forced every n_layer number of clock cycles.
    • The ALU are the MLPs + LayerNorms.
    • Awkwardly, as their params are not shared across layers, the ALU changes at each clock cycle.
    • Optionally the MLPs may also be interpreted as supporting a kind of fixed knowledge database lookup. The programs always takes the form [[LOAD, ALU]*N, STORE]*M, where N is n_layer and M is num_tokens.
  • Architecture GPT feels closer to a fixed-function than stored-program computer because the number of parameters is so large. In contrast, the description length of a CPU is very low and all the action is in the memory configuration. Another way to look at it is that GPT is a much more bloated/complex computer. Which is fine because it is not engineered but optimized and the upshot is that the programs can be shorter.
  • related: div garg version of software 3.0 https://divgarg.substack.com/p/software-3