Further Reading — LLaMA: Open and Efficient Foundation Language Models

Further Reading: LLaMA

The Original Paper

LLaMA: Open and Efficient Foundation Language Models (Meta AI, 2023)
Authors: Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, Azhar, Rodriguez, Joulin, Grave, Lample
arXiv: https://arxiv.org/abs/2302.13971
Venue: arXiv (widely cited, influenced the field)

The foundational paper describing the LLaMA models, architectural choices, training procedure, and evaluation on standard benchmarks.

Paper 13: Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)
Link: Chinchilla Scaling Laws
Why read it: Established the Chinchilla-optimal scaling laws that LLaMA applies. Essential for understanding why LLaMA trains smaller models on more data.

Paper 12: Language Models are Few-Shot Learners (Brown et al., 2020)
Link: GPT-3
Why read it: GPT-3 is the baseline model that LLaMA improves upon. Understanding GPT-3’s architecture and performance helps contextualize LLaMA’s contributions.

Paper 08: Attention Is All You Need (Vaswani et al., 2017)
Link: Transformer Architecture
Why read it: The transformer is the foundation for both LLaMA and all modern language models. Essential background.

Paper 15: Training Language Models to Follow Instructions from Human Feedback (Ouyang et al., 2022)
Link: InstructGPT / RLHF
Why read it: LLaMA-2-Chat uses RLHF. Understanding how RLHF works is important for the instruction-tuned variants.

LLaMA 2 and Beyond

LLaMA 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023)
arXiv: https://arxiv.org/abs/2307.09288
Why read it: Successor to LLaMA-1. Larger models (up to 70B), extended context (4096), commercial licensing, RLHF fine-tuning. More practical than LLaMA-1.

Code Llama: Open Foundation Models for Code (Rozière et al., 2023)
arXiv: https://arxiv.org/abs/2308.12950
Why read it: Fine-tuned LLaMA-2 on code-related tasks. Shows how LLaMA enables domain-specific adaptation.

Llama 3: Meta’s Latest Open Model (Meta, 2024)
https://www.meta.com/research/llama/
Why read it: Successor to LLaMA-2. Up to 405B parameters, improved instruction-following, better multilingual support.

Key Architectural Papers

RMSNorm: A Simpler and More Efficient Layer Normalization (Zhang and Sennrich, 2019)
arXiv: https://arxiv.org/abs/1910.07468
Why read it: Explains RMSNorm in detail. Small improvement but widely adopted in LLaMA and subsequent models.

GLU Variants Improve Transformer (Shazeer, 2020)
arXiv: https://arxiv.org/abs/2002.05202
Why read it: Introduces gating mechanisms and SwiGLU. Provides empirical evidence for why gating improves performance.

RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)
arXiv: https://arxiv.org/abs/2104.09864
Why read it: Introduces RoPE (Rotary Position Embeddings) in detail. Shows how RoPE enables generalization to longer sequences.

Fine-Tuning and Adaptation

LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
arXiv: https://arxiv.org/abs/2106.09685
Why read it: Introduced LoRA. Revolutionized LLaMA fine-tuning by enabling 0.1% parameter updates. Essential for understanding why LLaMA derivatives flourished.

Alpaca: A Strong Open-Source Instruction-Following Model (Taori et al., Stanford, 2023)
Blog: https://crfm.stanford.edu/2023/03/13/alpaca.html
Why read it: First major LLaMA fine-tune. Showed that cheap instruction fine-tuning (52K synthetic examples, ~$100) could produce useful assistants.

Vicuña: An Open-Source Chatbot Impressing GPT-4 (Chiang et al., UC Berkeley, 2023)
Blog: https://vicuna.lmsys.org/
Why read it: Improved on Alpaca using conversation data from ShareGPT. Demonstrates the importance of data quality for instruction fine-tuning.

QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
arXiv: https://arxiv.org/abs/2305.14314
Why read it: Combined quantization + LoRA. Enables fine-tuning of 7B-13B models on consumer GPUs.

Datasets and Benchmarks

MATH: Solving Competition-Level Math Problems with LLMs (Hendrycks et al., 2021)
https://github.com/hendrycks/math
Why read it: LLaMA is evaluated on this benchmark. Understanding the problems gives insight into model capabilities.

HELM: Holistic Evaluation of Language Models (Liang et al., 2022)
https://crfm.stanford.edu/helm/
Why read it: Comprehensive benchmark suite used to evaluate LLaMA and compare across models.

Alpaca-Eval: An Automatic Evaluator for LLMs (Dubey et al., 2023)
https://github.com/tatsu-lab/alpaca_eval
Why read it: Evaluation framework for instruction-tuned models using GPT-4 as judge. Widely used to compare LLaMA fine-tunes.

Scaling and Efficiency

Scaling Vision Transformers to 22 Billion Parameters (Zhai et al., 2022)
arXiv: https://arxiv.org/abs/2211.05556
Why read it: Examines scaling laws for vision, provides complementary insights to language model scaling.

Efficient Transformers: A Survey (Tay et al., 2020)
arXiv: https://arxiv.org/abs/2009.06732
Why read it: Survey of efficiency improvements in transformers. RMSNorm, RoPE, and other tricks discussed in the context of broader efficiency trends.

The Leak and Its Impact

What Happened When LLaMA Leaked (Opinion piece, multiple sources, 2023)
Various AI blogs and Twitter discussions
Why read it: Contextualizes the unintended release and its impact on the field. Shows how open-source quickly outpaced proprietary efforts.

Competing Open Models

Mistral 7B: The Optimal Mixture of Experts (Jiang et al., 2023)
arXiv: https://arxiv.org/abs/2310.06825
Why read it: Builds on LLaMA architecture, shows further improvements in efficiency.

Phi: Small Language Models Can Be Universal Agents (Microsoft, 2023)
arXiv: https://arxiv.org/abs/2306.05301
Why read it: Microsoft’s smaller open models inspired by LLaMA principles. Shows that LLaMA’s approach generalizes.

Qwen: Open-Source Language Models (Alibaba, 2023)
https://github.com/QwenLM/Qwen
Why read it: Large-scale open model following LLaMA-style training. Demonstrates global adoption of the approach.

Safety and Alignment

Responsible Release of LLaMA (Meta Blog, 2023)
https://www.meta.com/research/llama/
Why read it: Meta’s perspective on responsible open release, safety considerations, licensing.

The Risks and Opportunities of Large Language Models (Bommasani et al., Stanford CRFM, 2021)
arXiv: https://arxiv.org/abs/2108.07258
Why read it: Comprehensive analysis of LLM risks and opportunities. Relevant to understanding implications of open LLaMA release.

Constitutional AI: Harmlessness from AI Feedback (Bai et al., Anthropic, 2022)
arXiv: https://arxiv.org/abs/2212.08073
Why read it: Anthropic’s approach to making models safe without extensive RLHF. Complementary to LLaMA-2-Chat’s RLHF approach.

Implementation and Engineering

Transformer Circuits Thread (Chris Olah, Anthropic Blog)
https://transformer-circuits.pub/
Why read it: Deep dives into how transformers work mechanically. Useful for understanding what RMSNorm, RoPE, and other innovations do.

Hugging Face Transformers Library (Hugging Face)
https://huggingface.co/transformers/
Why read it: The standard library for loading and using LLaMA. Essential for practical work with the model.

vLLM: Efficient Serving of LLMs (Kwon et al., 2023)
arXiv: https://arxiv.org/abs/2309.06180
Why read it: Optimization framework for efficient inference of LLMs like LLaMA. Important for deployment.

What to Read Next

Paper 18: Mistral 7B (Future/Next)
Why next: Builds directly on LLaMA’s foundation. Shows iterative improvement over the LLaMA architecture.

Paper 16: Let’s Verify Step by Step (Process Reward Models)
Why related: Used with LLaMA for mathematical reasoning (o1-style training).

Paper 23: Scaling Test-Time Compute (Future/Related)
Why related: Applies LLaMA for efficient inference scaling via multiple generations and ranking.

Discussion Questions for Study Groups

Efficiency trade-offs: LLaMA is smaller than GPT-3 but trained on 5x more data. How does this affect fine-tuning? Does the extra pre-training data mean LLaMA fine-tunes better or worse than GPT-3?
Generalization across languages: LLaMA was trained on English-heavy data but still works somewhat for other languages. Why? How would you improve multilingual performance?
Safety in open release: LLaMA’s weights are public. Does this help or hurt AI safety? Compare to proprietary models like GPT-4.
Architectural choices: RMSNorm and RoPE are small improvements individually. Do they compound? How would you measure their individual contributions?
Fine-tuning as pre-training: Projects like Alpaca fine-tune LLaMA on 52K examples. How much does pre-training on 1.4T tokens matter vs. fine-tuning data?
Compute allocation: Given fixed compute, is LLaMA’s approach (smaller model, more data) always better? Are there tasks where a larger model on less data would be preferable?

Tutorials and Practical Guides

How to Fine-Tune LLaMA (Multiple sources: Hugging Face, Towards Data Science)
Why read it: Practical guides for fine-tuning using LoRA, PEFT, etc.

LLaMA Inference Optimization (Multiple sources)
Why read it: Tips for running LLaMA efficiently on various hardware (consumer GPU, CPU, mobile).

Building Applications with LLaMA (LlamaIndex, LangChain docs)
Why read it: Libraries for building production systems with LLaMA (RAG, agents, etc.).