🎨 FoxeTales AI Pipeline — Iter 7 (Current State)

Updated: 2026-05-28 | Latest from production code at github.com/Foxetales/ai-infra/workflows/faceswap_fullbody_v1.json

📥 Inputs (per render job)

1. Customer photo — kid face reference (JPG/PNG)
2. Template assets (artist-prepared, per page):

background.png — page background plate (no character)
character.png — character layer with alpha (transparent bg)
person_mask.png — full-body person region mask (NEW iter7, derived from character alpha)

3. Ethnicity metadata — text prompt anchor (vd "Vietnamese Asian")

⚙️ Processing pipeline

1. PuLID identity extraction

Customer photo → InsightFace AntelopeV2 face detection → EVA-CLIP embedding → PuLID-Flux v0.9.1 identity vector
Output: 1085MB PuLID model loaded, identity embedding ~512-dim

↓

2. Mask preparation

Person mask → GrowMask(+2) → FeatherMask(12px) → smooth person-region mask for inpaint
Iter7 change: switched from face-only mask to person-region mask (Chi's feedback fix)

↓

3. Conditioning

CLIP text encoders: CLIP-L + T5-xxl-fp16 (9.2GB)
Positive prompt: "fxtl style watercolor, {ethnicity} child, full body, skin tone consistent across face arms and legs, FLAT 2D, Foxetales storybook"
Negative: "photograph, realistic, 3d, semi-realistic, beauty filter"

↓

4. FLUX KSampler

FLUX.1-dev (23GB) + FoxeTales LoRA v4 (strength 0.7) + PuLID identity injection
Iter7 locked config: sampler euler, scheduler beta, steps 25, denoise 0.55 (preserves pose vs face-only 0.85)
SetLatentNoiseMask = person-region mask
Output: 1024×1024 latent regenerated in person region only

↓

5. VAE decode

FLUX VAE (ae.safetensors, 320MB) → decode latent to RGB image (1024×1024)

↓

6. Composite

ImageCompositeMasked with feathered person mask → blend regenerated character into original background plate

↓

7. Skin shift postproc (Black/dark ethnicity only)

pipeline/skin_shift.py — YCbCr skin detection × person_mask → multiply skin pixels by RGB scale
Presets: asian/blonde (no-op), indian (.85/.75/.65), black (.65/.50/.42), black_deep (.55/.42/.35)
WHY: Style LoRA trained on light-skin children → fights dark skin signal. Postproc bypasses LoRA bias.

↓

📤 Output: Final personalized page (1024×1024 PNG, ~1MB) ready for book layout / preview

📊 Performance characteristics

GPU time: ~25-32s per page (warm), ~60s first job (cold model load)
VRAM: ~45GB peak (FLUX 22.7GB + T5 9.3GB + PuLID 1.1GB + LoRA + working memory)
Throughput: ~110-130 pages/hour on RTX A6000 48GB
Cost: $0.43/h × 1 hour ≈ $0.43 / 120 pages = ~$0.0036/page raw GPU

🔄 Iter7 changes vs previous

Component	Iter6 (face-only)	Iter7 (current)
Mask type	Face bounding box	Person region (alpha-derived)
KSampler denoise	0.85	0.55 (preserve pose)
Grow / Feather	None	+2 / 12px (smooth body edges)
Prompt skin anchor	Just "{ethnicity}"	"+ skin tone consistent across face arms legs"
Postproc	None	YCbCr skin shift for dark ethnicities
ControlNet	Planned	⚠ Not yet — Phase 2 (cần cho pose-sensitive pages)

🚧 Known limitations (need Phase 2 work)

Pose preservation on sleeping/side-view pages (p42 risk) — needs ControlNet OpenPose
Hair style flexibility — currently forced to template hair (Chi accepted face-only limit on hair)
Black skin tone via LoRA — postproc workaround works but ideal is retrain LoRA v5 with diverse dataset
Speed — 25-32s/page acceptable for batch but slow for live preview (target <10s with fp8 quantization)