LTX-2 Pro

open-sourceself-hostedBest Value

Lightricks · Diffusion Transformer · v2.0verifiedVerified

$0.06/sec

starting from, on FAL.ai

Resolution

4K

Duration

6–10s

Providers

1

Text-to-VideoImage-to-VideoAudioLipsync

API Pricing

FAL.aiProCheapest
Try it →
Text-to-VideoAudio
$0.060/s
Text-to-VideoAudio
$0.120/s
Text-to-VideoAudio
$0.240/s
Image-to-VideoAudio
$0.060/s
Image-to-VideoAudio
$0.120/s
Image-to-VideoAudio
$0.240/s
Verified 2026-04-10

Why LTX-2 Pro?

thumb_upStrengths

  • Fully open source (Apache 2.0) — complete weights, training code, and LoRA trainer on GitHub for self-deployment
  • Lowest cost with native audio — $0.06/sec at 1080p on FAL.ai includes synchronized audio generation
  • Native 4K output at 50 fps — highest resolution and frame rate among open-source video models
  • Joint audio-video generation in a single inference pass — accurate lip sync and high audio fidelity
  • LoRA fine-tuning support with training completing in under an hour on capable hardware

infoLimitations

  • Limited to 16:9 aspect ratio — no portrait or square output natively supported
  • Maximum 10 seconds per generation — requires extension for longer sequences
  • No camera control panel, motion brush, or multi-shot generation features
  • Currently available only on FAL.ai for managed API access — fewer provider options than competitors
  • Text and logo rendering within video is not reliably supported

auto_fix_highPrompt Guide

  1. 1Write prompts as a flowing narrative describing a coherent sequence of events unfolding in time — not a list of visual elements or bullet points.
  2. 2Include five key components: scene anchor (location, time, atmosphere), subject + action (who/what and a verb), camera + lens (movement, focal length, framing), visual style (color science, grading), and motion/time cues (speed, frame intent).
  3. 3Start with close-ups and move outward — the model retains facial and material detail better in tight framing, while wide shots may soften likeness.
  4. 4Use concrete nouns and verbs over vague mood words — LTX-2 weighs specific visual and action terms more heavily than abstract atmosphere descriptions.
  5. 5Match prompt length to duration — 2-second clips need 2-3 sentences, while 10-second clips benefit from 6-8 sentences of detailed direction.

✓ Do this

  • For audio-video sync, use cue words like 'on the downbeat,' 'hit on second snare,' or 'cut point at 4s' to align action with generated audio timing
  • Use 16:9 or 21:9 for wide establishing shots, and 3:4 or 1:1 for close-up portraits
  • Leverage LoRA fine-tuning for consistent characters, styles, or brand-specific aesthetics — training completes in under an hour on capable hardware
  • Use the Fast tier ($0.04/sec at 1080p) for rapid iteration and the Pro tier ($0.06/sec) for final deliverables with full audio
  • Choose 50 fps for smooth slow-motion or high-fidelity motion, and 25 fps for standard cinematic output

✗ Avoid this

  • Cannot reliably generate readable text or logos within video frames
  • Overloaded prompts with too many simultaneous elements produce worse results — focus on simpler, directed scenes
  • Aspect ratio is limited to 16:9 — no portrait (9:16) or square (1:1) native support
  • No camera control panel or motion brush — all direction is through text prompts
  • Maximum 10 seconds per generation — longer content requires extension workflows

Example Prompts

Music / Audio-Visual

A street musician sits on a wooden crate in a narrow European alley at twilight. He strums an acoustic guitar, fingers sliding along the fretboard. The warm golden light from a lamp post casts long shadows. A gentle breeze rustles nearby café curtains. Camera holds steady, medium close-up, shallow depth of field. Guitar melody fills the alley.

Landscape / Nature

Aerial establishing shot slowly descending over a Japanese garden in autumn. Crimson maple leaves drift across a still koi pond reflecting the overcast sky. Camera tilts down as it descends, revealing stone lanterns along a gravel path. Soft ambient wind and water sounds.

Cinematic / Character

Close-up of a woman's face as she opens her eyes, surprised. Her pupils dilate as warm morning light streams through gauze curtains. Camera rack-focuses from her eyes to the window behind her. A clock ticks softly. Film grain, 35mm aesthetic, warm color grading.

Based on the official prompt guide →

FAQexpand_more

How much does LTX-2 Pro cost?

From $0.06/sec on FAL.ai. A 5-second video ≈ $0.30.

Where can I use LTX-2 Pro?

Via API on FAL.ai.

How do I get good results with LTX-2 Pro?

Write prompts as a flowing narrative describing a coherent sequence of events unfolding in time — not a list of visual elements or bullet points. See the prompt guide below.