SkyReels V4

audio-syncinpaintingvideo-editing

Skywork AI (Kunlun) · Dual-Stream Multimodal Diffusion Transformer (MMDiT) · v4.0verifiedVerified

$0.12/sec

starting from, on SkyReels Official API

Resolution

1080p

Duration

3–15s

Providers

1

Text-to-VideoImage-to-VideoAudioCameraMulti-ShotV2V

API Pricing

SkyReels Official APIV4Cheapest
Try it →
Text-to-VideoAudio
$0.120/s
Text-to-Video
$0.140/s
Image-to-VideoAudio
$0.120/s
Verified 2026-04-10

Why SkyReels V4?

thumb_upStrengths

  • First unified model for joint video-audio generation, inpainting, and editing in one architecture
  • Audio-video synchronization at ~40ms accuracy — drum hits and cuts land on beat markers
  • Accepts 5 input modalities (text, images, video, masks, audio references) for precise multi-modal control
  • Open-source lineage — V1-V3 weights all released on HuggingFace; V4 expected to follow
  • 32 FPS output at 1080p for smoother motion than the standard 24fps competitors

infoLimitations

  • Not yet available on FAL.ai, WaveSpeed, or Replicate — only accessible via SkyReels' own API
  • API access is invite-only with limited documentation as of April 2026
  • V4 model weights not publicly released yet despite open-source track record
  • 15-second max duration caps narrative potential
  • Text legibility inside generated scenes is unreliable — small UI elements and signage render poorly

auto_fix_highPrompt Guide

  1. 1Use concrete verbs for motion and audio sync — 'steam curls,' 'neon flickers,' 'door slams.' Verbs give the model rhythmic beats to synchronize audio against.
  2. 2Add sonic anchors to prompts — include audio cues like 'snare on 3,' 'vinyl crackle,' 'crowd cheer on cut' for more rhythmically aligned output.
  3. 3Combine multi-modal inputs — upload a reference image of a person, provide an audio sample of footsteps on gravel, and prompt 'walking through a forest at dawn' for maximum control.
  4. 4Keep inpainting masks small — large moving masks can wobble around frame 20-30. Feathering mask edges helps maintain temporal consistency.
  5. 5Shoot source material slightly wider than needed — V4 uses extra frame space to simulate camera movement without stretching details.
  6. 6Use early previews for fast iteration — rough previews let you reject bad takes in under a minute instead of waiting for full renders, saving 20-25 minutes per batch of six variations.

✓ Do this

  • Structure prompts as: Scene description + Camera movement + Subject action + Audio cues + Style/mood
  • For video inpainting, provide a binary mask and describe only the change — V4 preserves unmasked regions automatically
  • For audio-guided generation, provide a short reference clip (5-10s) to condition the audio branch on a specific sound profile
  • Text prompts work as a 'fast sketch' — start broad ('overhead desk shot, notebook pages turning, warm morning light') then refine with specific details like paper grain and shutter feel
  • Leverage storyboard integration for multi-shot sequences — provide a visual storyboard alongside text for complex multi-modal conditioning

✗ Avoid this

  • 15-second max duration restricts long-form narrative — competitors like Kling v3 also cap at 15s but some models offer 30-60s
  • Text rendering inside scenes is weak — avoid prompts requiring legible text on signs, screens, or UI elements
  • API access remains invite-heavy with limited documentation as of April 2026
  • Large inpainting masks can cause temporal wobble — regeneration may affect unintended frame areas
  • Weights not yet publicly released despite open-source commitment — V1-V3 weights are on HuggingFace but V4 is still pending

Example Prompts

Ambient / Lifestyle

Overhead desk shot, notebook pages turning slowly, warm morning light streaming through blinds. Paper grain visible, soft shutter-click feel. The quiet scratch of a fountain pen writing, distant traffic hum.

Cinematic / Urban

Wide tracking shot through a neon-lit Tokyo alley at night. Rain falling steadily, puddles reflecting pink and blue neon. A figure in a black jacket walks away from camera. Sound: rain hitting pavement, distant bass from a club, footsteps splashing.

Product / Food

Close-up of hands kneading bread dough on a floured wooden surface. Camera slowly pulls back to reveal a rustic kitchen. Audio: dough slapping rhythmically, oven humming, birds outside a window.

Based on the official prompt guide →

FAQexpand_more

How much does SkyReels V4 cost?

From $0.12/sec on SkyReels Official API. A 5-second video ≈ $0.60.

Where can I use SkyReels V4?

Via API on SkyReels Official API.

How do I get good results with SkyReels V4?

Use concrete verbs for motion and audio sync — 'steam curls,' 'neon flickers,' 'door slams.' Verbs give the model rhythmic beats to synchronize audio against. See the prompt guide below.