SkyReels V4

audio-syncinpaintingvideo-editing

Skywork AI (Kunlun) · Dual-Stream Multimodal Diffusion Transformer (MMDiT) · v4.0verifiedVerified

$0.12/sec

starting from, on SkyReels Official API

Resolution

1080p

Duration

3–15s

Providers

Text-to-VideoImage-to-VideoAudioCameraMulti-ShotV2V

API Pricing

SkyReels Official APIV4Cheapest

Try it →

Text-to-VideoAudio

$0.120/s

Text-to-Video

$0.140/s

Image-to-VideoAudio

$0.120/s

Verified 2026-04-10

Why SkyReels V4?

thumb_upStrengths

First unified model for joint video-audio generation, inpainting, and editing in one architecture
Audio-video synchronization at ~40ms accuracy — drum hits and cuts land on beat markers
Accepts 5 input modalities (text, images, video, masks, audio references) for precise multi-modal control
Open-source lineage — V1-V3 weights all released on HuggingFace; V4 expected to follow
32 FPS output at 1080p for smoother motion than the standard 24fps competitors

infoLimitations

Not yet available on FAL.ai, WaveSpeed, or Replicate — only accessible via SkyReels' own API
API access is invite-only with limited documentation as of April 2026
V4 model weights not publicly released yet despite open-source track record
15-second max duration caps narrative potential
Text legibility inside generated scenes is unreliable — small UI elements and signage render poorly

auto_fix_highPrompt Guide

1Use concrete verbs for motion and audio sync — 'steam curls,' 'neon flickers,' 'door slams.' Verbs give the model rhythmic beats to synchronize audio against.
2Add sonic anchors to prompts — include audio cues like 'snare on 3,' 'vinyl crackle,' 'crowd cheer on cut' for more rhythmically aligned output.
3Combine multi-modal inputs — upload a reference image of a person, provide an audio sample of footsteps on gravel, and prompt 'walking through a forest at dawn' for maximum control.
4Keep inpainting masks small — large moving masks can wobble around frame 20-30. Feathering mask edges helps maintain temporal consistency.
5Shoot source material slightly wider than needed — V4 uses extra frame space to simulate camera movement without stretching details.
6Use early previews for fast iteration — rough previews let you reject bad takes in under a minute instead of waiting for full renders, saving 20-25 minutes per batch of six variations.

✓ Do this

Structure prompts as: Scene description + Camera movement + Subject action + Audio cues + Style/mood
For video inpainting, provide a binary mask and describe only the change — V4 preserves unmasked regions automatically
For audio-guided generation, provide a short reference clip (5-10s) to condition the audio branch on a specific sound profile
Text prompts work as a 'fast sketch' — start broad ('overhead desk shot, notebook pages turning, warm morning light') then refine with specific details like paper grain and shutter feel
Leverage storyboard integration for multi-shot sequences — provide a visual storyboard alongside text for complex multi-modal conditioning

✗ Avoid this

15-second max duration restricts long-form narrative — competitors like Kling v3 also cap at 15s but some models offer 30-60s
Text rendering inside scenes is weak — avoid prompts requiring legible text on signs, screens, or UI elements
API access remains invite-heavy with limited documentation as of April 2026
Large inpainting masks can cause temporal wobble — regeneration may affect unintended frame areas
Weights not yet publicly released despite open-source commitment — V1-V3 weights are on HuggingFace but V4 is still pending

Example Prompts

Ambient / Lifestyle

“Overhead desk shot, notebook pages turning slowly, warm morning light streaming through blinds. Paper grain visible, soft shutter-click feel. The quiet scratch of a fountain pen writing, distant traffic hum.”

Cinematic / Urban

“Wide tracking shot through a neon-lit Tokyo alley at night. Rain falling steadily, puddles reflecting pink and blue neon. A figure in a black jacket walks away from camera. Sound: rain hitting pavement, distant bass from a club, footsteps splashing.”

Product / Food

“Close-up of hands kneading bread dough on a floured wooden surface. Camera slowly pulls back to reveal a rustic kitchen. Audio: dough slapping rhythmically, oven humming, birds outside a window.”

Based on the official prompt guide →

FAQexpand_more

How much does SkyReels V4 cost?

From $0.12/sec on SkyReels Official API. A 5-second video ≈ $0.60.

Where can I use SkyReels V4?

Via API on SkyReels Official API.

How do I get good results with SkyReels V4?

Use concrete verbs for motion and audio sync — 'steam curls,' 'neon flickers,' 'door slams.' Verbs give the model rhythmic beats to synchronize audio against. See the prompt guide below.