This guide covers the complete lifecycle of Llama models - from discovery and download to cleanup and removal.
Unlike MLC which has a prebuilt set of models, the Llama provider can run any GGUF model from HuggingFace. You can browse available models at HuggingFace GGUF Models.
Here are some popular models that work well on mobile devices:
| Model ID | Size | Best For |
|---|---|---|
ggml-org/SmolLM3-3B-GGUF/SmolLM3-Q4_K_M.gguf | ~1.8GB | Balanced performance and quality |
Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q3_k_m.gguf | ~1.9GB | General conversations |
lmstudio-community/gemma-2-2b-it-GGUF/gemma-2-2b-it-Q3_K_M.gguf | ~2.3GB | High quality responses |
Note: When selecting models, consider quantization levels (Q3, Q4, Q5, etc.). Lower quantization = smaller size but potentially lower quality. Q4_K_M is a good balance for mobile.
Models are downloaded from HuggingFace using the storage API. The downloadModel function returns the path to the downloaded file:
You can track download progress:
Check if a model is already downloaded:
Create model instances using the provider methods. Pass the model path (from downloadModel() or getModelPath()):
With configuration options:
After creating a model instance, prepare it for inference (loads it into memory):
Calling prepare() ahead of time is recommended for optimal performance. If not called, the model will auto-prepare when first used, but a warning will be logged.
Once prepared, use the model with AI SDK functions:
For advanced usage, you can access the underlying LlamaContext:
Unload the model from memory to free resources:
llamaDefault provider instance with the following methods:
llama.languageModel(modelPath, options?)Creates a language model instance.
modelPath: Path to the model file (from downloadModel() or getModelPath())options:
projectorPath: Path to multimodal projector for vision/audio supportprojectorUseGpu: Use GPU for multimodal processing (default: true)contextParams: llama.rn context parameters
n_ctx: Context size (default: 2048, or 4096 for multimodal)n_gpu_layers: Number of GPU layers (default: 99)llama.textEmbeddingModel(modelPath, options?)Creates an embedding model instance.
modelPath: Path to the model file (from downloadModel() or getModelPath())options:
normalize: Normalize embeddings (default: -1)contextParams: llama.rn context parameters
n_ctx: Context size (default: 2048)n_gpu_layers: Number of GPU layers (default: 99)n_parallel: Parallel embeddings (default: 8)llama.speechModel(modelPath, options)Creates a speech model instance for text-to-speech.
modelPath: Path to the model file (from downloadModel() or getModelPath())options:
vocoderPath: Required - Path to vocoder model filevocoderBatchSize: Batch size for vocoder processingcontextParams: llama.rn context parametersThese functions are exported directly for model management. Models are stored in ${DocumentDir}/llama-models/.
downloadModel(modelId, progressCallback?)Download a model from HuggingFace.
modelId: Model identifier in format owner/repo/filename.ggufprogressCallback: Optional callback with { percentage: number }Promise<string> - Path to the downloaded model filegetModelPath(modelId)Get the local file path for a model (without downloading).
modelId: Model identifier in format owner/repo/filename.ggufstring - Path where the model file is/would be storedisModelDownloaded(modelId)Check if a model is downloaded.
modelId: Model identifier in format owner/repo/filename.ggufPromise<boolean>All model types share these common methods:
prepare(): Initialize/load model into memorygetContext(): Get the underlying LlamaContext (for advanced usage)unload(): Release model from memory