SkyReels V4
audio-syncinpaintingvideo-editingSkywork AI (Kunlun) · Dual-Stream Multimodal Diffusion Transformer (MMDiT) · v4.0verifiedVerified
$0.12/sec
starting from, on SkyReels Official API
Resolution
1080p
Duration
3–15s
Providers
1
API Pricing
Why SkyReels V4?
thumb_upStrengths
- First unified model for joint video-audio generation, inpainting, and editing in one architecture
- Audio-video synchronization at ~40ms accuracy — drum hits and cuts land on beat markers
- Accepts 5 input modalities (text, images, video, masks, audio references) for precise multi-modal control
- Open-source lineage — V1-V3 weights all released on HuggingFace; V4 expected to follow
- 32 FPS output at 1080p for smoother motion than the standard 24fps competitors
infoLimitations
- Not yet available on FAL.ai, WaveSpeed, or Replicate — only accessible via SkyReels' own API
- API access is invite-only with limited documentation as of April 2026
- V4 model weights not publicly released yet despite open-source track record
- 15-second max duration caps narrative potential
- Text legibility inside generated scenes is unreliable — small UI elements and signage render poorly
auto_fix_highPrompt Guide
- 1Use concrete verbs for motion and audio sync — 'steam curls,' 'neon flickers,' 'door slams.' Verbs give the model rhythmic beats to synchronize audio against.
- 2Add sonic anchors to prompts — include audio cues like 'snare on 3,' 'vinyl crackle,' 'crowd cheer on cut' for more rhythmically aligned output.
- 3Combine multi-modal inputs — upload a reference image of a person, provide an audio sample of footsteps on gravel, and prompt 'walking through a forest at dawn' for maximum control.
- 4Keep inpainting masks small — large moving masks can wobble around frame 20-30. Feathering mask edges helps maintain temporal consistency.
- 5Shoot source material slightly wider than needed — V4 uses extra frame space to simulate camera movement without stretching details.
- 6Use early previews for fast iteration — rough previews let you reject bad takes in under a minute instead of waiting for full renders, saving 20-25 minutes per batch of six variations.
✓ Do this
- Structure prompts as: Scene description + Camera movement + Subject action + Audio cues + Style/mood
- For video inpainting, provide a binary mask and describe only the change — V4 preserves unmasked regions automatically
- For audio-guided generation, provide a short reference clip (5-10s) to condition the audio branch on a specific sound profile
- Text prompts work as a 'fast sketch' — start broad ('overhead desk shot, notebook pages turning, warm morning light') then refine with specific details like paper grain and shutter feel
- Leverage storyboard integration for multi-shot sequences — provide a visual storyboard alongside text for complex multi-modal conditioning
✗ Avoid this
- 15-second max duration restricts long-form narrative — competitors like Kling v3 also cap at 15s but some models offer 30-60s
- Text rendering inside scenes is weak — avoid prompts requiring legible text on signs, screens, or UI elements
- API access remains invite-heavy with limited documentation as of April 2026
- Large inpainting masks can cause temporal wobble — regeneration may affect unintended frame areas
- Weights not yet publicly released despite open-source commitment — V1-V3 weights are on HuggingFace but V4 is still pending
Example Prompts
“Overhead desk shot, notebook pages turning slowly, warm morning light streaming through blinds. Paper grain visible, soft shutter-click feel. The quiet scratch of a fountain pen writing, distant traffic hum.”
“Wide tracking shot through a neon-lit Tokyo alley at night. Rain falling steadily, puddles reflecting pink and blue neon. A figure in a black jacket walks away from camera. Sound: rain hitting pavement, distant bass from a club, footsteps splashing.”
“Close-up of hands kneading bread dough on a floured wooden surface. Camera slowly pulls back to reveal a rustic kitchen. Audio: dough slapping rhythmically, oven humming, birds outside a window.”
Based on the official prompt guide →
FAQexpand_more
How much does SkyReels V4 cost?
From $0.12/sec on SkyReels Official API. A 5-second video ≈ $0.60.
Where can I use SkyReels V4?
Via API on SkyReels Official API.
How do I get good results with SkyReels V4?
Use concrete verbs for motion and audio sync — 'steam curls,' 'neon flickers,' 'door slams.' Verbs give the model rhythmic beats to synchronize audio against. See the prompt guide below.