Aliyun Bailian (4): Wanxiang Video Generation End-to-End
Wanxiang text-to-video and image-to-video for production: the async task pattern, polling with backoff, prompt techniques that survive contact with reality, and the OSS write-through that saves you when result URLs expire.
Wanxiang is the API that has done the most for our marketing pipeline and caused the most production surprises. The model is genuinely good — wan2.5-t2v-plus produces 720p clips that pass for an actual video team’s output most of the time — but the surface around it is async, native-protocol, has expiring URLs, and rate-limits in non-obvious ways. This article is the version of the docs that has been through six months of “why is this happening at 2am” tickets.
The model lineup
Three models, all native-only (no OpenAI compat), all async:

wan2.5-t2v-plus is the one I use 80% of the time — text-to-video is the most flexible and the easiest to brief without a designer. wan2.5-i2v-plus is for cases where the marketing team already has a hero image they want to animate (a still product shot becomes a 5-second turntable). wan2.5-kf2v-plus for transitions: hand it a first frame and a last frame, get back the in-between motion.
The end-to-end flow
There is one flow, repeated for every video:

The minimum viable Python:
| |
Polling with backoff — pick a sensible schedule
Polling every second is wasteful and gets you rate-limited. Polling every 30 seconds wastes user time. The backoff schedule I use:

Start at 5 seconds, multiply by 1.45 each iteration, cap at 60 seconds. A typical 720p 5-second clip finishes in 30-90 seconds, so the median user waits about 4 polls.
For a backend service, the right pattern is often not to poll inside the request handler. Instead:
- User submits prompt → you POST to Wanxiang and store
task_idin your DB. - Return immediately with a job URL.
- A background worker polls and updates the DB when
SUCCEEDED. - The frontend polls your DB, not Wanxiang.
That gives you retry, observability, and a place to store the result URL before it expires.
Save the URL immediately — they expire in 24h
The single most expensive mistake I have seen in production: someone fetched the result_url, displayed it on the site, and then the page broke 24 hours later when the URL stopped resolving. The URLs Wanxiang returns are signed and time-bound. Always copy the file to your own OSS bucket on success:
| |
I do this synchronously inside the polling worker, before returning success. If the archive step fails, the task isn’t done.
Prompt patterns that survive
A surprisingly high fraction of Wanxiang quality is in the prompt. After a few months of iteration, the structure that works:
[shot type], [subject], [action], [setting / environment],
[lighting], [camera movement], [style], [quality keywords]
Examples that have gone to production:
wide angle, a cup of bubble tea, condensation drops sliding down the cup, on a marble table next to a window, soft afternoon backlight, slow dolly in, photorealistic, 4k, shallow depth of fieldmedium shot, a young woman wearing a Hanfu dress, walking through a Hangzhou bamboo forest, early morning mist, dappled light, smooth tracking shot from behind, cinematic film look, 35mm
Things that hurt quality:
- Negative prompts in the main prompt (“no text on screen”). Use a
negative_promptparameter if you need them. - More than ~3 main subjects. The model conflates them.
- Specific brand or person names. Generic descriptions work better.
- Anything cyrillic / Arabic / Devanagari script as text-on-frame. Wanxiang is currently English- and Chinese-text aware; other scripts come out as garbled glyphs.
Image-to-video and keyframe-to-video
Same flow, different model and inputs. I2V takes an image_url (OSS-signed URL works); KF2V takes first_frame_url and last_frame_url. The duration limits are model-dependent (typically 5 or 10 seconds); read the model card before generating.
A useful production pattern for product demos:
- Photographer ships a hero still.
- We prompt: “the product slowly rotating on a rotating platform, studio lighting”.
- I2V produces a 5-second turntable.
- Append to the hero image’s product page.
Cost is a few RMB per clip; the alternative is a half-day of someone’s photography time.
What to do when SUCCEEDED but the video looks wrong
The most common failure is “the model generated something, but it ignored half the prompt”. Causes:
- Prompt too long. Wanxiang has a soft limit; aggressive trimming helps.
- Prompt contradictory (“daytime, dark, neon”). Pick one.
- Wrong model variant. T2V will not animate a specific image; you wanted I2V.
- Wrong aspect ratio. The
sizeparameter shapes composition;1280*720and720*1280produce different framings.
Generate three variants per critical prompt with different seeds (seed parameter). One of them is usually the right one.
Cost and rate limits
Wanxiang is per-second-of-video billed. A 5-second 720p clip is on the order of a few RMB. Concurrent task limits are per-API-key — for production traffic, request a quota increase via the console before you launch. The default of (last I checked) 5 concurrent tasks per workspace is fine for prototyping and instantly insufficient for any real product.
What’s next
Article 5 closes the series with Qwen-TTS-Flash — speech synthesis with the only Chinese-dialect voices I’d ship to production. It’s also native-only, so the patterns from this article apply.