ARTS7853
Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt
Zhanjie Zhang, Quanwei Zhang, Huaizhong Lin, Wei Xing, Juncheng Mo, Shuaicheng Huang, Jinheng Xie, Guangyuan Li, Junsheng Luan, Lei Zhao, Dalong Zhang, Lixia Chen
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
Artistic style transfer aims to transfer the learned artistic style onto an arbitrary content image, generating artistic stylized images. Existing generative adversarial network-based methods fail to generate highly realistic stylized images and always introduce obvious artifacts and disharmonious patterns. Recently, large-scale pre-trained diffusion models opened up a new way for generating highly realistic artistic stylized images. However, diffusion model-based methods generally fail to preserve the content structure of input content images well, introducing some undesired content structure and style patterns. To address the above problems, we propose a novel pre-trained diffusion-based artistic style transfer method, called LSAST, which can generate highly realistic artistic stylized images while preserving the content structure of input content images well, without bringing obvious artifacts and disharmonious style patterns. Specifically, we introduce a Step-aware and Layer-aware Prompt Space, a set of learnable prompts, which can learn the style information from the collection of artworks and dynamically adjusts the input images’ content structure and style pattern. To train our prompt space, we propose a novel inversion method, called Step-ware and Layer-aware Prompt Inversion, which allows the prompt space to learn the style information of the artworks collection. In addition, we inject a pre-trained conditional branch of ControlNet into our LSAST, which further improved our framework’s ability to maintain content structure. Extensive experiments demonstrate that our proposed method can generate more highly realistic artistic stylized images than the state-of-the-art artistic style transfer methods. Code is available at https://github.com/Jamie-Cheung/LSAST.
ARTS7864
MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models
Yixiao Zhang, Yukara Ikemiya, Gus Xia, Naoki Murata, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Yuki Mitsufuji, Simon Dixon
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
Recent advances in text-to-music generation models have opened new avenues in musical creativity. However, the task of editing these generated music remains a significant challenge. This paper introduces a novel approach to edit music generated by such models, enabling the modification of specific attributes, such as genre, mood, and instrument, while maintaining other aspects unchanged. Our method transforms text editing to the latent space manipulation, and adds an additional constraint to enforce consistency. It seamlessly integrates with existing pretrained text-to-music diffusion models without requiring additional training. Experimental results demonstrate superior performance over both zero-shot and certain supervised baselines in style and timbre transfer evaluations. We also show the practical applicability of our approach in real-world music editing scenarios.
ARTS7868
GladCoder: Stylized QR Code Generation with Grayscale-Aware Denoising Process
Yuqiu Xie, Bolin Jiang, Jiawei Li, Naiqi Li, Bin Chen, Tao Dai, Yuang Peng, Shu-Tao Xia
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
Traditional QR codes consist of a grid of black-and-white square modules, which lack aesthetic appeal and meaning for human perception. This has motivated recent research to beautify the visual appearance of QR codes. However, there exists a trade-off between the visual quality and scanning-robustness of the image, causing outputs of previous works are simple and of low quality to ensure scanning-robustness. In this paper, we introduce a novel approach GladCoder to generate stylized QR codes that are personalized, natural, and text-driven. Its pipeline includes a Depth-guided Aesthetic QR code Generator (DAG) to improve quality of image foreground, and a GrayscaLe-Aware Denoising (GLAD) process to enhance scanning-robustness. The overall pipeline is based on diffusion models, which allow users to create stylized QR images from a textual prompt to describe the image and a textual input to be encoded. Experiments demonstrate that our method can generate stylized QR code with appealing perception details, while maintaining robust scanning reliability under real world applications.
ARTS7883
Musical Phrase Segmentation via Grammatical Induction
Reed Perkins, Dan Ventura
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
We outline a solution to the challenge of musical phrase segmentation that uses grammatical induction algorithms, a class of algorithms which infer a context-free grammar from an input sequence. We analyze the performance of five grammatical induction algorithms on three datasets using various musical viewpoint combinations. Our experiments show that the LONGESTFIRST algorithm achieves the best F1 scores across all three datasets and that input encodings that include the duration viewpoint result in the best performance.
ARTS7905
Re-creation of Creations: A New Paradigm for Lyric-to-Melody Generation
Ang Lv, Xu Tan, Tao Qin, Tie-Yan Liu, Rui Yan
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
Current lyric-to-melody generation methods struggle with the lack of paired lyric-melody data to train, and the lack of adherence to composition guidelines, resulting in melodies that do not sound human-composed. To address these issues, we propose a novel paradigm called Re-creation of Creations (ROC) that combines the strengths of both rule-based and neural-based methods. ROC consists of a two-stage generation-retrieval pipeline: the creation and re-creation stages. In the creation stage, we train a melody language model using melody data to generate high-quality music fragments, which are stored in a database indexed by key features. In the re-creation stage, users provide lyrics and a preferred chord progression, and ROC infers melody features for each lyric sentence. By querying the database, we obtain relevant melody fragments that satisfy composition guidelines, and these candidates are filtered, re-ranked, and concatenated based on the guidelines and the melody language model scores. ROC offers two main advantages: it does not require paired lyric-melody data, and it incorporates commonly used composition guidelines, resulting in music that sounds more human-composed with better controllability. Both objective and subjective evaluation results on English and Chinese lyrics show the effectiveness of ROC.
ARTS7916
Intertwining CP and NLP: The Generation of Unreasonably Constrained Sentences
Alexandre Bonlarron, Jean-Charles Régin
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
Constrained text generation remains a challenging task, particularly when dealing with hard constraints. Traditional Natural Language Processing (NLP) approaches prioritize generating meaningful and coherent output. Also, the current state-of-the-art methods often lack the expressiveness and constraint satisfaction capabilities to handle such tasks effectively. This paper presents the Constraints First Framework to remedy this issue. This framework considers a constrained text generation problem as a discrete combinatorial optimization problem. It is solved by a constraint programming method that combines linguistic properties (e.g., n-grams or language level) with other more classical constraints (e.g., the number of characters, syllables, or words). Eventually, a curation phase allows for selecting the best-generated sentences according to perplexity using a large language model. The effectiveness of this approach is demonstrated by tackling a new more tediously constrained text generation problem: the iconic RADNER sentences problem. This problem aims to generate sentences respecting a set of quite strict rules defined by their use in vision and clinical research. Thanks to our CP-based approach, many new strongly constrained sentences have been successfully generated in an automatic manner. This highlights the potential of our approach to handle unreasonably constrained text generation scenarios.
ARTS7922
Perception-Inspired Graph Convolution for Music Understanding Tasks
Emmanouil Karystinaios, Francesco Foscarin, Gerhard Widmer
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
We propose a new graph convolutional block, called MusGConv, specifically designed for the efficient processing of musical score data and motivated by general perceptual principles. It focuses on two fundamental dimensions of music, pitch and rhythm, and considers both relative and absolute representations of these components. We evaluate our approach on four different musical understanding problems: monophonic voice separation, harmonic analysis, cadence detection, and composer identification which, in abstract terms, translate to different graph learning problems, namely, node classification, link prediction, and graph classification. Our experiments demonstrate that MusGConv improves the performance on three of the aforementioned tasks while being conceptually very simple and efficient. We interpret this as evidence that it is beneficial to include perception-informed processing of fundamental musical concepts when developing graph network applications on musical score data. All code and models are released on https://github.com/manoskary/musgconv.
ARTS7968
Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models
Zhongjie Duan, Chengyu Wang, Cen Chen, Weining Qian, Jun Huang
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
Toon shading is a type of non-photorealistic rendering task in animation. Its primary purpose is to render objects with a flat and stylized appearance. As diffusion models have ascended to the forefront of image synthesis, this paper delves into an innovative form of toon shading based on diffusion models, aiming to directly render photorealistic videos into anime styles. In video stylization, existing methods encounter persistent challenges, notably in maintaining consistency and achieving high visual quality. In this paper, we model the toon shading problem as four subproblems, i.e., stylization, consistency enhancement, structure guidance, and colorization. To address the challenges in video stylization, we propose an effective toon shading approach called Diffutoon. Diffutoon is capable of rendering remarkably detailed, high-resolution, and extended-duration videos in anime style. It can also edit the video content according to input prompts via an additional branch. The efficacy of Diffutoon is evaluated through quantitive metrics and human evaluation. Notably, Diffutoon surpasses both open-source and closed-source baseline approaches in our experiments. Our work is accompanied by the release of both the source code and example videos on Github.
ARTS7969
Re:Draw – Context Aware Translation as a Controllable Method for Artistic Production
João Libório Cardoso, Francesco Banterle, Paolo Cignoni, Michael Wimmer
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
We introduce context-aware translation, a novel method that combines the benefits of inpainting and image-to-image translation, respecting simultaneously the original input and contextual relevance — where existing methods fall short. By doing so, our method opens new avenues for the controllable use of AI within artistic creation, from animation to digital art. As an use case, we apply our method to redraw any hand-drawn animated character eyes based on any design specifications — eyes serve as a focal point that captures viewer attention and conveys a range of emotions; however, the labor-intensive nature of traditional animation often leads to compromises in the complexity and consistency of eye design. Furthermore, we remove the need for production data for training and introduce a new character recognition method that surpasses existing work by not requiring fine-tuning to specific productions. This proposed use case could help maintain consistency throughout production and unlock bolder and more detailed design choices without the production cost drawbacks. A user study shows context-aware translation is preferred over existing work 95.16% of the time.
ARTS8097
KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph
Yanbei Jiang, Krista A. Ehinger, Jey Han Lau
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork’s meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE (Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that maximizes the similarity between the image and its corresponding metadata. Experimental results demonstrate that KALE achieves strong performance (when evaluated with CIDEr, in particular) over existing state-of-the-art work across several artwork datasets. Source code of the project is available at https://github.com/Yanbei-Jiang/Artwork-Interpretation.
ARTS8213
GEM: Generating Engaging Multimodal Content
Chongyang Gao, Yiren Jian, Natalia Denisenko, Soroush Vosoughi, V. S. Subrahmanian
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
Generating engaging multimodal content is a key objective in numerous applications, such as the creation of online advertisements that captivate user attention through a synergy of images and text. In this paper, we introduce GEM, a novel framework engineered for the generation of engaging multimodal image-text posts. The GEM framework operates in two primary phases. Initially, GEM integrates a pre-trained engagement discriminator with a technique for deriving an effective continuous prompt tailored for the stable diffusion model. Subsequently, GEM unveils an iterative algorithm dedicated to producing coherent and compelling image-sentence pairs centered around a specified topic of interest. Through a combination of experimental analysis and human evaluations, we establish that the image-sentence pairs generated by GEM not only surpass several established baselines in terms of engagement but also in achieving superior alignment.
ARTS8225
Manipulating Embeddings of Stable Diffusion Prompts
Niklas Deckers, Julia Peters, Martin Potthast
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
Prompt engineering is still the primary way for users of generative text-to-image models to manipulate generated images in a targeted way. Based on treating the model as a continuous function and by passing gradients between the image space and the prompt embedding space, we propose and analyze a new method to directly manipulate the embedding of a prompt instead of the prompt text. We then derive three practical interaction tools to support users with image generation: (1) Optimization of a metric defined in the image space that measures, for example, the image style. (2) Supporting a user in creative tasks by allowing them to navigate in the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that a user has seen in a particular seed but has difficulty describing in the prompt. Compared to prompt engineering, user-driven prompt embedding manipulation enables a more fine-grained, targeted control that integrates a user’s intentions. Our user study shows that our methods are considered less tedious and that the resulting images are often preferred.
ARTS8232
DP-Font: Chinese Calligraphy Font Generation Using Diffusion Model and Physical Information Neural Network
Liguo Zhang, Yalong Zhu, Achref Benarab, Yusen Ma, Yuxin Dong, Jianguo Sun
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
As a typical visual art form, Chinese calligraphy has a long history and aesthetic value. However, current methods for generating Chinese fonts still struggle with complex character shapes and lack personalized writing styles. To address these issues, we propose a font generation method for Chinese Calligraphy based on diffusion model incorporating physical information neural network (PINN), which is named DP-Font. Firstly, the multi-attribute guidance is combined to guide the generation process of the diffusion model and introduce the critical constraint of stroke order in Chinese characters, aiming to significantly improve the quality of the generated results. We then incorporate physical constraints into the neural network loss function, utilizing physical equations to provide in-depth guidance and constraints on the learning process. By learning the movement rule of the nib and the diffusion pattern of the ink, DP-Font can generate personalized calligraphy styles. The generated fonts are very close to the calligraphers’ works. Compared with existing deep learning-based techniques, DP-Font has made significant progress in enhancing the physical plausibility of the model, generating more realistic and high-quality results.
ARTS8245
MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music
Zihao Wang, Shuyu Li, Tao Zhang, Qi Wang, Pengfei Yu, Jinyang Luo, Yan Liu, Ming Xi, Kejun Zhang
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
The rapidly evolving multimodal Large Language Models (LLMs) urgently require new benchmarks to uniformly evaluate their performance on understanding and textually describing music. However, due to semantic gaps between Music Information Retrieval (MIR) algorithms and human understanding, discrepancies between professionals and the public, and low precision of annotations, existing music description datasets cannot serve as benchmarks. To this end, we present MuChin, the first open-source music description benchmark in Chinese colloquial language, designed to evaluate the performance of multimodal LLMs in understanding and describing music. We established the Caichong Music Annotation Platform (CaiMAP) that employs an innovative multi-person, multi-stage assurance method, and recruited both amateurs and professionals to ensure the precision of annotations and alignment with popular semantics. Utilizing this method, we built a large-scale, private dataset with multi-dimensional, high-precision music annotations, the Caichong Music Dataset (CaiMD), and carefully selected 1,000 high-quality entries to serve as the test set for MuChin. Based on MuChin, we analyzed the discrepancies between professionals and amateurs in terms of music description, and empirically demonstrated the effectiveness of CaiMD for fine-tuning LLMs. Ultimately, we employed MuChin to evaluate existing music understanding models on their ability to provide colloquial descriptions of music.
ARTS8246
Paintings and Drawings Aesthetics Assessment with Rich Attributes for Various Artistic Categories
Xin Jin, Qianqian Qiao, Yi Lu, Huaye Wang, Shan Gao, Heng Huang, Guangdong Li
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
Image aesthetic evaluation is a highly prominent research domain in the field of computer vision. In recent years, there has been a proliferation of datasets and corresponding evaluation methodologies for assessing the aesthetic quality of photographic works, leading to the establishment of a relatively mature research environment. However, in contrast to the extensive research in photographic aesthetics, the field of aesthetic evaluation for paintings and drawings has seen limited attention until the introduction of the BAID dataset in March 2023. This dataset solely comprises overall scores for high-quality artistic images. Our research marks the pioneering introduction of a multi-attribute, multi-category dataset specifically tailored to the field of painting: Aesthetics of Paintings and Drawings Dataset (APDD). The construction of APDD received active participation from 28 professional artists worldwide, along with dozens of students specializing in the field of art. This dataset encompasses 24 distinct artistic categories and 10 different aesthetic attributes. Each image in APDD has been evaluated by six professionally trained experts in the field of art, including assessments for both total aesthetic scores and aesthetic attribute scores. The final APDD dataset comprises a total of 4985 images, with an annotation count exceeding 31100 entries. Concurrently, we propose an innovative approach: Art Assessment Network for Specific Painting Styles (AANSPS), designed for the assessment of aesthetic attributes in mixed-attribute art datasets. Through this research, our goal is to catalyze advancements in the field of aesthetic evaluation for paintings and drawings, while enriching the available resources and methodologies for its further development and application. Dataset is available at https://github.com/BestiVictory/APDD.git
ARTS8252
End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding
Wei Zeng, Xian He, Ye Wang
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.
ARTS8255
Integrating View Conditions for Image Synthesis
Jinbin Bai, Zhen Dong, Aosong Feng, Xiao Zhang, Tian Ye, Kaicheng Zhou
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
In the field of image processing, applying intricate semantic modifications within existing images remains an enduring challenge. This paper introduces a pioneering framework that integrates viewpoint information to enhance the control of image editing tasks, especially for interior design scenes. By surveying existing object editing methodologies, we distill three essential criteria — consistency, controllability, and harmony — that should be met for an image editing method. In contrast to previous approaches, our framework takes the lead in satisfying all three requirements for addressing the challenge of image synthesis. Through comprehensive experiments, encompassing both quantitative assessments and qualitative comparisons with contemporary state-of-the-art methods, we present compelling evidence of our framework’s superior performance across multiple dimensions. This work establishes a promising avenue for advancing image synthesis techniques and empowering precise object modifications while preserving the visual coherence of the entire composition.
ARTS8260
Large Language Models for Human-AI Co-Creation of Robotic Dance Performances
Allegra De Filippo, Michela Milano
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
This paper focuses on the potential of Generative Artificial Intelligence (AI), particularly Large Language Models (LLMs), in the still unexplored domain of robotic dance creation. In particular, we assess whether a LLM (GPT-3.5 turbo) can create robotic dance choreographies, and we investigate if the feedback provided by human creators can improve the quality of the output. To this end, we design three prompt engineering techniques for robotic dance creation. In the prompts, we gradually introduce human knowledge through examples and feedback in natural language in order to explore the dynamics of human-AI co-creation. The experimental analysis shows that the capabilities of the LLM can be improved through human collaboration, by producing choreographies with a major artistic impact on the evaluation audience. The findings offer valuable insights into the interplay between human creativity and AI generative models, paving the way for enhanced collaborative frameworks in creative domains.
ARTS8263
Inferring Iterated Function Systems Approximately from Fractal Images
Haotian Liu, Dixin Luo, Hongteng Xu
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
As an important mathematical concept, fractals commonly appear in nature and inspire the design of many artistic works. Although we can generate various fractal images easily based on different iterated function systems (IFSs), inferring an IFS from a given fractal image is still a challenging inverse problem for both scientific research and artistic design. In this study, we explore the potential of deep learning techniques for this problem, learning a multi-head auto-encoding model to infer typical IFSs (including Julia set and L-system) from fractal images. In principle, the proposed model encodes fractal images in a latent space and decodes their corresponding IFSs based on the latent representations. For the fractal images generated by heterogeneous IFSs, we let them share the same encoder and apply two decoders to infer the sequential and non-sequential parameters of their IFSs, respectively. By introducing one more decoder to reconstruct fractal images, we can leverage large-scale unlabeled fractal images to learn the model in a semi-supervised way, which suppresses the risk of over-fitting. Comprehensive experiments demonstrate that our method provides a promising solution to infer IFSs approximately from fractal images. Code and supplementary file are available at \url{https://github.com/HaotianLiu123/Inferring-IFSs-From-Fractal-Images}.
ARTS8267
Expressing Musical Ideas with Constraint Programming Using a Model of Tonal Harmony
Damien Sprockeels, Peter Van Roy
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
The realm of music composition with artificial intelligence stands as a pertinent and evolving field, attracting increasing interest and exploration in contemporary research and practice. This paper presents a constraint-programming based approach to generating four-voice diatonic chord progressions according to established rules of tonal harmony. It uses the strength of constraint programming as a formal logic to rigorously model musical rules and to offer complete control over the set of rules that are enforced. This allows composers to iteratively interact with the model, adding and removing constraints, allowing them to shape the solutions according to their preferences. We define a constraint model of basic tonal harmony, called Diatony. We show that our implementation using the Gecode solver finds optimal solutions in reasonable time and we show how it can be used by a composer to aid in their composition process.
ARTS8280
Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls
Liwei Lin, Gus Xia, Yixiao Zhang, Junyan Jiang
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
Controllable music generation plays a vital role in human-AI music co-creation. While Large Language Models (LLMs) have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To bridge this gap, To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive language models to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at https://kikyo-16.github.io/AIR.
ARTS8284
Retrieval Guided Music Captioning via Multimodal Prefixes
Nikita Srivatsan, Ke Chen, Shlomo Dubnov, Taylor Berg-Kirkpatrick
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
In this paper we put forward a new approach to music captioning, the task of automatically generating natural language descriptions for songs. These descriptions are useful both for categorization and analysis, and also from an accessibility standpoint as they form an important component of closed captions for video content. Our method supplements an audio encoding with a retriever, allowing the decoder to condition on multimodal signal both from the audio of the song itself as well as a candidate caption identified by a nearest neighbor system. This lets us retain the advantages of a retrieval based approach while also allowing for the flexibility of a generative one. We evaluate this system on a dataset of 200k music-caption pairs scraped from Audiostock, a royalty-free music platform, and on MusicCaps, a dataset of 5.5k pairs. We demonstrate significant improvements over prior systems across both automatic metrics and human evaluation.
ARTS8286
From Pixels to Metal: AI-Empowered Numismatic Art
Penousal Machado, Tiago Martins, João Correia, Luís Espírito Santo, Nuno Lourenço, João Cunha, Sérgio Rebelo, Pedro Martins, João Bicker
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
This paper describes our response to a unique challenge presented by the Portuguese National Press-Mint: to use Artificial Intelligence to design a commemorative coin that celebrates the "digital world". We explain the process of this coin’s co-creation, from conceptualisation to production, highlighting the design process, the underlying rationale, key obstacles encountered, and the technical innovations and developments made to meet the challenge. These include the development of an evolutionary art system guided by Contrastive Language–Image Pre-training (CLIP) and Machine Learning-based aesthetic models, a system for prompt evolution, and a representation for encoding genotypes in mintable format. This collaboration produced a limited edition 10 euro silver proof coin, with a total of 4000 units minted by the National Press-Mint. The coin was met with enthusiasm, selling out within two months. This work contributes to Computational Creativity, particularly co-creativity, co-design, and digital art, and represents a significant step in using Artificial Intelligence for Numismatics.
ARTS8287
FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation
Jianyi Chen, Wei Xue, Xu Tan, Zhen Ye, Qifeng Liu, Yike Guo
6 min. talk | August 8th at 10:00 | Session: AI and Arts 1/3
[+] More
[-] Less
Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-time applications. In this paper, we aim to develop a Fast SAG method that can create high-quality and coherent accompaniments. A non-AR diffusion-based framework is developed, which by carefully designing the conditions inferred from the vocal signals, generates the Mel spectrogram of the target accompaniment directly. With diffusion and Mel spectrogram modeling, the proposed method significantly simplifies the AR token-based SingSong framework, and largely accelerates the generation. We also design semantic projection, prior projection blocks as well as a set of loss functions, to ensure the generated accompaniment has semantic and rhythm coherence with the vocal signal. By intensive experimental studies, we demonstrate that the proposed method can generate better samples than SingSong, and accelerate the generation by at least 30 times. Audio samples and code are available at this link.
ARTS8290
Disrupting Diffusion-based Inpainters with Semantic Digression
Geonho Son, Juhun Lee, Simon S. Woo
6 min. talk | August 8th at 15:00 | Session: AI and Arts 3/3
[+] More
[-] Less
The fabrication of visual misinformation on the web and social media has increased exponentially with the advent of foundational text-to-image diffusion models. Namely, Stable Diffusion inpainters allow the synthesis of maliciously inpainted images of personal and private figures, and copyrighted contents, also known as deepfakes. To combat such generations, a disruption framework, namely Photoguard, has been proposed, where it adds adversarial noise to the context image to disrupt their inpainting synthesis. While their framework suggested a diffusion-friendly approach, the disruption is not sufficiently strong and it requires a significant amount of GPU and time to immunize the context image. In our work, we re-examine both the minimal and favorable conditions for a successful inpainting disruption, proposing DDD, a “Digression guided Diffusion Disruption” framework. First, we identify the most adversarially vulnerable diffusion timestep range with respect to the hidden space. Within this scope of noised manifold, we pose the problem as a semantic digression optimization. We maximize the distance between the inpainting instance’s hidden states and a semantic-aware hidden state centroid, calibrated both by Monte Carlo sampling of hidden states and a discretely projected optimization in the token space. Effectively, our approach achieves stronger disruption and a higher success rate than Photoguard while lowering the GPU memory requirement, and speeding the optimization up to three times faster.
ARTS8291
A Conflict-Embedded Narrative Generation Using Commonsense Reasoning
Youngrok Song, Gunhee Cho, HyunJu Kim, Youngjune Kim, Byung-Chull Bae, Yun-Gyung Cheong
6 min. talk | August 8th at 11:30 | Session: AI and Arts 2/3
[+] More
[-] Less
Conflict is a critical element in the narrative, inciting dramatic tension. This paper introduces CNGCI (Conflict-driven Narrative Generation through Commonsense Inference), a neuro-symbolic framework designed to generate coherent stories embedded with conflict using commonsense inference. Our framework defines narrative conflict by leveraging the concept of a soft causal threat, where conflict serves as an obstacle that reduces the likelihood of achieving the protagonist’s goal by weakening the causal link between context and goal through defeasible inference. Comparative studies against multiple story generation baselines utilizing commonsense reasoning show that our framework outperforms the baselines in creating narratives that distinctly embody conflict while maintaining coherency.