TY - GEN
T1 - MATIC
T2 - 18th IEEE Pacific Visualization Conference, PacificVis 2025
AU - Wu, Chiao Hsin
AU - Lai, I. Wei
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - The advancements in diffusion model enables the creation of highly detailed images. However, concurrently fusing texts and images poses significant challenges, often struggling with the maintenance of text accuracy across languages, optimal placement, and appropriate typography. To address these challenges, we introduce the Multilingual Accurate Textual Image Customization (MATIC) framework. MATIC employs the Chain-of-Thought (CoT) concept to decompose the textual image generation process into multiple steps, leveraging diverse generative artificial intelligence, including Multimodal Large Language Model (MMLLM) and diffusion model. The framework first generates the desired text and a corresponding prompt for the diffusion model based on user input. The diffusiongenerated image is then examined to remove any undesired text. Meanwhile, the typographic elements are designed to align with the visual content. Finally, the textual image is fused with the aid of a grid coordinate system, evaluated by MMLLM, and further customized by the user through natural language. Experimental results demonstrate that MATIC can produce accurate, high-quality, multilingual textual images that meet user requirements across various domains, including digital marketing, graphic design, and educational content creation.
AB - The advancements in diffusion model enables the creation of highly detailed images. However, concurrently fusing texts and images poses significant challenges, often struggling with the maintenance of text accuracy across languages, optimal placement, and appropriate typography. To address these challenges, we introduce the Multilingual Accurate Textual Image Customization (MATIC) framework. MATIC employs the Chain-of-Thought (CoT) concept to decompose the textual image generation process into multiple steps, leveraging diverse generative artificial intelligence, including Multimodal Large Language Model (MMLLM) and diffusion model. The framework first generates the desired text and a corresponding prompt for the diffusion model based on user input. The diffusiongenerated image is then examined to remove any undesired text. Meanwhile, the typographic elements are designed to align with the visual content. Finally, the textual image is fused with the aid of a grid coordinate system, evaluated by MMLLM, and further customized by the user through natural language. Experimental results demonstrate that MATIC can produce accurate, high-quality, multilingual textual images that meet user requirements across various domains, including digital marketing, graphic design, and educational content creation.
KW - Artificial intelligence
KW - Computer Vision
KW - Computing methodologies
KW - Natural language processing
UR - https://www.scopus.com/pages/publications/105009210910
UR - https://www.scopus.com/pages/publications/105009210910#tab=citedBy
U2 - 10.1109/PacificVis64226.2025.00042
DO - 10.1109/PacificVis64226.2025.00042
M3 - Conference contribution
AN - SCOPUS:105009210910
T3 - IEEE Pacific Visualization Symposium
SP - 352
EP - 357
BT - Proceedings - 2025 IEEE 18th Pacific Visualization Conference, PacificVis 2025
PB - IEEE Computer Society
Y2 - 22 April 2025 through 25 April 2025
ER -