68
Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror Descent
Hang Xu, Kai Li, Bingyun Liu, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng
12 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (2/6)
[+] More
[-] Less
Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. It decomposes the total regret into counterfactual regrets, utilizing local regret minimization algorithms, such as Regret Matching (RM) or RM+, to minimize them. Recent research establishes a connection between Online Mirror Descent (OMD) and RM+, paving the way for an optimistic variant PRM+ and its extension PCFR+. However, PCFR+ assigns uniform weights for each iteration when determining regrets, leading to substantial regrets when facing dominated actions. This work explores minimizing weighted counterfactual regret with optimistic OMD, resulting in a novel CFR variant PDCFR+. It integrates PCFR+ and Discounted CFR (DCFR) in a principled manner, swiftly mitigating negative effects of dominated actions and consistently leveraging predictions to accelerate convergence. Theoretical analyses prove that PDCFR+ converges to a Nash equilibrium, particularly under distinct weighting schemes for regrets and average strategies. Experimental results demonstrate PDCFR+’s fast convergence in common imperfect-information games. The code is available at https://github.com/rpSebastian/PDCFRPlus.
List of keywords
Machine Learning -> ML: Game Theory
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
70
Structure-Preserving Physics-Informed Neural Networks with Energy or Lyapunov Structure
Haoyu Chu, Yuto Miyatake, Wenjun Cui, Shikui Wei, Daisuke Furihata
6 min. talk | August 7th at 11:30 | Session: ML: Deep learning architectures
[+] More
[-] Less
Recently, there has been growing interest in using physics-informed neural networks (PINNs) to solve differential equations. However, the preservation of structure, such as energy and stability, in a suitable manner has yet to be established. This limitation could be a potential reason why the learning process for PINNs is not always efficient and the numerical results may suggest nonphysical behavior. Besides, there is little research on their applications on downstream tasks. To address these issues, we propose structure-preserving PINNs to improve their performance and broaden their applications for downstream tasks. Firstly, by leveraging prior knowledge about the physical system, a structure‐preserving loss function is designed to assist the PINN in learning the underlying structure. Secondly, a framework that utilizes structure-preserving PINN for robust image recognition is proposed. Here, preserving the Lyapunov structure of the underlying system ensures the stability of the system. Experimental results demonstrate that the proposed method improves the numerical accuracy of PINNs for partial differential equations (PDEs). Furthermore, the robustness of the model against adversarial perturbations in image data is enhanced.
List of keywords
Machine Learning -> ML: Deep learning architectures
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Machine learning for vision
Machine Learning -> ML: Supervised Learning
97
Automatic De-Biased Temporal-Relational Modeling for Stock Investment Recommendation
Weijun Chen, Shun Li, Xipu Yu, Heyuan Wang, Wei Chen, Tengjiao Wang
6 min. talk | August 7th at 11:30 | Session: DM: Applications
[+] More
[-] Less
Stock investment recommendation is crucial for guiding investment decisions and managing portfolios. Recent studies have demonstrated the potential of temporal-relational models (TRM) to yield excess investment returns. However, in the complicated finance ecosystem, the current TRM suffer from both the intrinsic temporal bias from the low signal-to-noise ratio (SNR) and the relational bias caused by utilizing inappropriate relational topologies and propagation mechanisms. Moreover, the distribution shifts behind macro-market scenarios invalidate the underlying i.i.d. assumption and limit the generalization ability of TRM. In this paper, we pioneer the impact of the above issues on the effective learning of temporal-relational patterns and propose an Automatic De-Biased Temporal-Relational Model (ADB-TRM) for stock recommendation. Specifically, ADB-TRM consists of three main components, i.e., (i) a meta-learned architecture forms a dual-stage training process, with the inner part ameliorating temporal-relational bias and the outer meta-learner counteracting distribution shifts, (ii) automatic adversarial sample generation guides the model adaptively to alleviate bias and enhance its profiling ability through adversarial training, and (iii) global-local interaction helps seek relative invariant stock embeddings from local and global distribution perspectives to mitigate distribution shifts. Experiments on three datasets from distinct stock markets show that ADB-TRM excels state-of-the-arts over 28.41% and 9.53% in terms of cumulative and risk-adjusted returns.
List of keywords
Data Mining -> DM: Applications
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Time series and data streams
Multidisciplinary Topics and Applications -> MTA: Finance
103
TaD: A Plug-and-Play Task-Aware Decoding Method to Better Adapt LLMs on Downstream Tasks
Xinhao Xu, Hui Chen, Zijia Lin, Jungong Han, Lixing Gong, Guoxin Wang, Yongjun Bao, Guiguang Ding
6 min. talk | August 7th at 15:00 | Session: NLP: Natural Language Processing (2/3)
[+] More
[-] Less
Fine-tuning pre-trained models on downstream tasks is a common practice in leveraging large language models (LLMs) today. A critical issue is how to adapt pre-trained models to downstream tasks better, thereby enhancing their performance. This paper introduces Task-aware Decoding (TaD), a plug-and-play method that exploits the difference in probability distributions before and after fine-tuning to boost the performance of LLMs on downstream tasks. The proposed TaD argues that the difference between the pre-finetuning probability distribution and the post-finetuning one represents the direction from common knowledge towards specific downstream-task knowledge. Aligning the final output probability distribution to that direction can probably result in superior downstream task performance, compared to the original fine-tuned model. Experiments on various datasets across four different task categories well demonstrate TaD’s effectiveness on different LLMs, i.e., GPT, BLOOM, and LLaMA, with different fine-tuning methods. Moreover, further experiments reveal that TaD better enhances model performance in data-scarce scenarios.
List of keywords
Natural Language Processing -> NLP: Language generation
Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Language models
128
Physics-Informed Trajectory Prediction for Autonomous Driving under Missing Observation
Haicheng Liao, Chengyue Wang, Zhenning Li, Yongkang Li, Bonan Wang, Guofa Li, Chengzhong Xu
6 min. talk | August 6th at 11:30 | Session: ROB: Robotics (1/2)
[+] More
[-] Less
This paper introduces a novel trajectory prediction approach for autonomous vehicles (AVs), adeptly addressing the challenges of missing observations and the need for adherence to physical laws in real-world driving environments. This study proposes a hierarchical two-stage trajectory prediction model for AVs. In the first stage we propose the Wavelet Reconstruction Network, an innovative tool expertly crafted for reconstructing missing observations, offering optional integration with state-of-the-art models to enhance their robustness. Additionally, the second stage of the model features the Wave Fusion Encoder, a quantum mechanics-inspired innovation for sophisticated vehicle interaction modeling. By incorporating the Kinematic Bicycle Model, we ensure that our predictions align with realistic vehicular kinematics. Complementing our methodological advancements, we introduce MoCAD-missing, a comprehensive real-world traffic dataset, alongside enhanced versions of the NGSIM and HighD datasets, designed to facilitate rigorous testing in environments with missing observations. Extensive evaluations demonstrate that our approach markedly outperforms existing methods, achieving high accuracy even in scenarios with up to 75% missing observations.
List of keywords
Robotics -> ROB: Other
Planning and Scheduling -> PS: Planning under uncertainty
129
MFTraj: Map-Free, Behavior-Driven Trajectory Prediction for Autonomous Driving
Haicheng Liao, Zhenning Li, Chengyue Wang, Huanming Shen, Dongping Liao, Bonan Wang, Guofa Li, Chengzhong Xu
6 min. talk | August 6th at 15:00 | Session: MTA: Transportation
[+] More
[-] Less
This paper introduces a trajectory prediction model tailored for autonomous driving, focusing on capturing complex interactions in dynamic traffic scenarios without reliance on high-definition maps. The model, termed MFTraj, harnesses historical trajectory data combined with a novel dynamic geometric graph-based behavior-aware module. At its core, an adaptive structure-aware interactive graph convolutional network captures both positional and behavioral features of road users, preserving spatial-temporal intricacies. Enhanced by a linear attention mechanism, the model achieves computational efficiency and reduced parameter overhead. Evaluations on the Argoverse, NGSIM, HighD, and MoCAD datasets underscore MFTraj’s robustness and adaptability, outperforming numerous benchmarks even in data-challenged scenarios without the need for additional information such as HD maps or vectorized maps. Importantly, it maintains competitive performance even in scenarios with substantial missing data (12.5%-50%), outperforming most existing state-of-the-art models. The results and methodology suggest a significant advancement in autonomous driving trajectory prediction, paving the way for safer and efficient autonomous systems.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Transportation
Agent-based and Multi-agent Systems -> MAS: Applications
Agent-based and Multi-agent Systems -> MAS: Multi-agent planning
Robotics -> ROB: Other
130
A Cognitive-Driven Trajectory Prediction Model for Autonomous Driving in Mixed Autonomy Environments
Haicheng Liao, Zhenning Li, Chengyue Wang, Bonan Wang, Hanlin Kong, Yanchen Guan, Guofa Li, Zhiyong Cui
6 min. talk | August 6th at 15:00 | Session: MTA: Transportation
[+] More
[-] Less
As autonomous driving technology progresses, the need for precise trajectory prediction models becomes paramount. This paper introduces an innovative model that infuses cognitive insights into trajectory prediction, focusing on perceived safety and dynamic decision-making. Distinct from traditional approaches, our model excels in analyzing interactions and behavior patterns in mixed autonomy traffic scenarios. We introduce the Macao Connected Autonomous Driving (MoCAD) dataset as part of our contributions, which adds value to its complex urban driving scenarios. Our model represents a significant leap forward, achieving marked performance improvements on several key datasets. Specifically, it surpasses existing benchmarks with gains of 16.2% on the Next Generation Simulation (NGSIM), 27.4% on the Highway Drone (HighD), and 19.8% on the MoCAD dataset. Our proposed model shows exceptional proficiency in handling corner cases, essential for real-world applications. Moreover, its robustness is evident in scenarios with missing or limited data, outperforming most of the state-of-the-art baselines. This adaptability and resilience position our model as a viable tool for real-world autonomous driving systems, heralding a new standard in vehicle trajectory prediction for enhanced safety and efficiency.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Transportation
Agent-based and Multi-agent Systems -> MAS: Human-agent interaction
Planning and Scheduling -> PS: Applications
Robotics -> ROB: Motion and path planning
139
Hyperparameter Optimization Can Even Be Harmful in Off-Policy Learning and How to Deal with It
Yuta Saito, Masahiro Nomura
12 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (5/6)
[+] More
[-] Less
There has been a growing interest in off-policy evaluation in the literature such as recommender systems and personalized medicine. We have so far seen significant progress in developing estimators aimed at accurately estimating the effectiveness of counterfactual policies based on biased logged data. However, there are many cases where those estimators are used not only to evaluate the value of decision making policies but also to search for the best hyperparameters from a large candidate space. This work explores the latter hyperparameter optimization (HPO) task for off-policy learning. We empirically show that naively applying an unbiased estimator of the generalization performance as a surrogate objective in HPO can cause an unexpected failure, merely pursuing hyperparameters whose generalization performance is greatly overestimated. We then propose simple and computationally efficient corrections to the typical HPO procedure to deal with the aforementioned issues simultaneously. Empirical investigations demonstrate the effectiveness of our proposed HPO algorithm in situations where the typical procedure fails severely.
List of keywords
Machine Learning -> ML: Causality
Machine Learning -> ML: Hyperparameter optimization
Machine Learning -> ML: Multi-armed bandits
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
158
AutoAgents: A Framework for Automatic Agent Generation
Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje Karlsson, Jie Fu, Yemin Shi
6 min. talk | August 9th at 10:00 | Session: MAS: Applications
[+] More
[-] Less
Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the adaptability of multi-agent collaboration to different scenarios. Therefore, we introduce AutoAgents, an innovative framework that adaptively generates and coordinates multiple specialized agents to build an AI team according to different tasks. Specifically, AutoAgents couples the relationship between tasks and roles by dynamically generating multiple required agents based on task content and planning solutions for the current task based on the generated expert agents. Multiple specialized agents collaborate with each other to efficiently accomplish tasks. Concurrently, an observer role is incorporated into the framework to reflect on the designated plans and agents’ responses and improve upon them. Our experiments on various benchmarks demonstrate that AutoAgents generates more coherent and accurate solutions than the existing multi-agent methods. This underscores the significance of assigning different roles to different tasks and of team cooperation, offering new perspectives for tackling complex tasks. The repository of this project is available at https://github.com/Link-AGI/AutoAgents.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Applications
Natural Language Processing -> NLP: Applications
161
Boosting Model Resilience via Implicit Adversarial Data Augmentation
Xiaoling Zhou, Wei Ye, Zhemg Lee, Rui Xie, Shikun Zhang
12 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (1/6)
[+] More
[-] Less
Data augmentation plays a pivotal role in enhancing and diversifying training data. Nonetheless, consistently improving model performance in varied learning scenarios, especially those with inherent data biases, remains challenging. To address this, we propose to augment the deep features of samples by incorporating their adversarial and anti-adversarial perturbation distributions, enabling adaptive adjustment in the learning difficulty tailored to each sample’s specific characteristics. We then theoretically reveal that our augmentation process approximates the optimization of a surrogate loss function as the number of augmented copies increases indefinitely. This insight leads us to develop a meta-learning-based framework for optimizing classifiers with this novel loss, introducing the effects of augmentation while bypassing the explicit augmentation process. We conduct extensive experiments across four common biased learning scenarios: long-tail learning, generalized long-tail learning, noisy label learning, and subpopulation shift learning. The empirical results demonstrate that our method consistently achieves state-of-the-art performance, highlighting its broad adaptability.
List of keywords
Machine Learning -> ML: Classification
Data Mining -> DM: Class imbalance and unequal cost
Machine Learning -> ML: Meta-learning
Machine Learning -> ML: Robustness
168
Negative-Binomial Randomized Gamma Dynamical Systems for Heterogeneous Overdispersed Count Time Sequences
Rui Huang, Sikun Yang, Heinz Koeppl
6 min. talk | August 8th at 15:00 | Session: ML: Time series and data streams
[+] More
[-] Less
Modeling count-valued time sequences has been receiving growing interests because count time sequences naturally arise in physical and social domains. Poisson gamma dynamical systems (PGDSs) are newly-developed methods, which can well capture the expressive latent transition structure and bursty dynamics behind count sequences. In particular, PGDSs demonstrate superior performance in terms of data imputation and prediction, compared with canonical linear dynamical system (LDS) based methods. Despite these advantages, PGDS cannot capture the heterogeneous overdispersed behaviours of the underlying dynamic processes. To mitigate this defect, we propose a negative-binomial-randomized gamma Markov process, which not only significantly improves the predictive performance of the proposed dynamical system, but also facilitates the fast convergence of the inference algorithm. Moreover, we develop methods to estimate both factor-structured and graph-structured transition dynamics, which enable us to infer more explainable latent structure, compared with PGDSs. Finally, we demonstrate the explainable latent structure learned by the proposed method, and show its superior performance in imputing missing data and forecasting future observations, compared with the related models.
List of keywords
Machine Learning -> ML: Time series and data streams
Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Probabilistic machine learning
Uncertainty in AI -> UAI: Tractable probabilistic models
169
Scale and Direction Guided GAN for Inertial Sensor Signal Enhancement
Yifeng Wang, Yi Zhao
6 min. talk | August 9th at 11:30 | Session: ML: Generative models
[+] More
[-] Less
Inertial sensors, serving as attitude and motion sensing components, are extensively used in various portable devices spanning consumer electronics, sports health, aerospace, etc. However, the severe intrinsic errors of inertial sensors greatly restrict their capability to implement advanced functions, such as motion tracking and semantic recognition. Although generative models hold significant potential for signal enhancement, unsupervised or weakly-supervised generative methods may not achieve ideal generation results due to the absence of guidance from paired data. To address this, we propose a scale and direction-guided generative adversarial network (SDG-GAN), which provides dual guidance mechanisms for GAN with unpaired data across two practical application scenarios. In the unsupervised scenario where only unpaired signals of varying quality are available, our scale-guided GAN (SG-GAN) forces the generator to learn high-quality signal characteristics at different scales simultaneously via the proposed self-supervised zoom constraint, thereby facilitating multi-scale interactive learning. In the weakly-supervised scenario, where additional experimental equipment can provide some motion information, our direction-guided GAN (DG-GAN) introduces auxiliary tasks to supervise signal generation while avoiding interference from auxiliary tasks on the main generation task. Extensive experiments demonstrate that both the unsupervised SG-GAN and the weakly-supervised DG-GAN significantly outperform all comparison methods, including fully-supervised approaches. The combined SDG-GAN achieves remarkable results, enabling unimaginable tasks based on the original inertial signal, such as 3D motion tracking.
List of keywords
Machine Learning -> ML: Generative models
Machine Learning -> ML: Unsupervised learning
Machine Learning -> ML: Weakly supervised learning
Multidisciplinary Topics and Applications -> MTA: Sensor networks and smart cities
170
Nukplex: An Efficient Local Search Algorithm for Maximum K-Plex Problem
Rui Sun, Yiyuan Wang, Shimao Wang, Hui Li, Ximing Li, Minghao Yin
6 min. talk | August 8th at 10:00 | Session: S: Search
[+] More
[-] Less
The maximum k-plex problem (MKPP) is an significant relaxation version of the maximum clique problem with extensive applications. Recently, lots of researchers have proposed many heuristic algorithms based on various methods to solve the MKPP. In this work, to further improve the performance of solving the MKPP, we propose an efficient local search algorithm based on three main ideas. First, we propose a relaxed bounded configuration checking strategy that considers two kinds of historical searching information to relax the restricted strength of configuration checking and the forbidden condition of candidate vertices for the operation Add, respectively. Second, we present a novel solution information-based vertex selection strategy based on two kinds of solution information to select high-quality candidate vertices. Third, we define the solution core and then introduce a core-based perturbation strategy to help the algorithm jump out of local optima. The experimental results show that the proposed algorithm significantly outperforms the state-of-the-art MKPP algorithms in almost all the instances.
List of keywords
Search -> S: Local search
Search -> S: Heuristic search
185
Design a Win-Win Strategy That Is Fair to Both Service Providers and Tasks When Rejection Is Not an Option
Yohai Trabelsi, Pan Xu, Sarit Kraus
12 min. talk | August 7th at 11:30 | Session: MAS: Agent-based and Multi-agent Systems (1/2)
[+] More
[-] Less
Assigning tasks to service providers is a frequent procedure across various applications. Often the tasks arrive dynamically while the service providers remain static. Preventing task rejection caused by service provider overload is of utmost significance. To ensure a positive experience in relevant applications for both service providers and tasks, fairness must be considered. To address the issue, we model the problem as an online matching within a bipartite graph and tackle two minimax problems: one focuses on minimizing the highest waiting time of a task, while the other aims to minimize the highest workload of a service provider. We show that the second problem can be expressed as a linear program and thus solved efficiently while maintaining a reasonable approximation to the objective of the first problem. We developed novel methods that utilize the two minimax problems. We conducted extensive simulation experiments using real data and demonstrated that our novel heuristics, based on the linear program, performed remarkably well.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Resource allocation
190
Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion
Bohan Li, Yasheng Sun, Zhujin Liang, Dalong Du, Zhuanghui Zhang, Xiaofeng Wang, Yunnan Wang, Xin Jin, Wenjun Zeng
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird’s-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https://github.com/Arlo0o/StereoScene.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding   
195
Unbiased Active Semi-supervised Binary Classification Models
JooChul Lee, Weidong Ma, Ziyang Wang
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (3/6)
[+] More
[-] Less
Active learning is known to be a well-motivated algorithm that aims to maximize model performance with relatively small data, but it introduces sampling bias due to active selection. To adjust the bias, current literature utilizes corrective weights in a supervised learning approach. However, those methods consider only a small amount of actively sampled data and thus estimation efficiency can be improved using unsampled data together. In this paper, we develop an actively improved augmented estimation equation (AI-AEE) based on corrective weights as well as imputation models that allow us to leverage unlabeled data. The asymptotic distribution of the proposed estimator as the solution to the AI-AEE is derived, and an optimal sampling scheme to minimize the asymptotic mean squared error of the estimator is proposed. We then propose a general practical algorithm for training prediction models in the active and semi-supervised learning framework. The superiority of our method is demonstrated on synthetic and real data examples.
List of keywords
Machine Learning -> ML: Active learning
Machine Learning -> ML: Regression
Machine Learning -> ML: Semi-supervised learning
200
Proportion-based Sensitivity Analysis of Uncontrolled Confounding Bias in Causal Inference
Haruka Yoshida, Manabu Kuroki
6 min. talk | August 7th at 10:00 | Session: UAI: Causality, structural causal models and causal inference
[+] More
[-] Less
Uncontrolled confounding bias causes a spurious relationship between an exposure variable and an outcome variable and precludes reliable evaluation of the causal effect from observed data.Thus, it is important to observe a sufficient set of confounders to reliably evaluate the causal effect.However, there is no statistical method for judging whether an available set of covariates is sufficient to derive a reliable estimator for the causal effect.To address this problem, we focus on the fact that the mean squared error (MSE) of the outcome variable with respect to the average causal risk can be described as the sum of "the conditional variance of the outcome variable given the exposure variable" and "the square of the uncontrolled confounding bias".We then propose a novel sensitivity analysis, namely, the proportion-based sensitivity analysis of uncontrolled confounding bias in causal effects (PSA) in which the sensitivity parameter is formulated as the proportion of "the square of the uncontrolled confounding bias" to the MSE, and we clarify some properties.We also demonstrate the applicability of the PSA through two case studies.
List of keywords
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
204
Contrastive General Graph Matching with Adaptive Augmentation Sampling
Jianyuan Bo, Yuan Fang
12 min. talk | August 8th at 11:30 | Session: ML: Unsupervised learning
[+] More
[-] Less
Graph matching has important applications in pattern recognition and beyond. Current approaches predominantly adopt supervised learning, demanding extensive labeled data which can be limited or costly. Meanwhile, self-supervised learning methods for graph matching often require additional side information such as extra categorical information and input features, limiting their application to the general case. Moreover, designing the optimal graph augmentations for self-supervised graph matching presents another challenge to ensure robustness and efficacy. To address these issues, we introduce a novel Graph-centric Contrastive framework for Graph Matching (GCGM), capitalizing on a vast pool of graph augmentations for contrastive learning, yet without needing any side information. Given the variety of augmentation choices, we further introduce a Boosting-inspired Adaptive Augmentation Sampler (BiAS), which adaptively selects more challenging augmentations tailored for graph matching. Through various experiments, our GCGM surpasses state-of-the-art self-supervised methods across various datasets, marking a significant step toward more effective, efficient and general graph matching.
List of keywords
Machine Learning -> ML: Unsupervised learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Sequence and graph learning
214
FedSSA: Semantic Similarity-based Aggregation for Efficient Model-Heterogeneous Personalized Federated Learning
Liping Yi, Han Yu, Zhuan Shi, Gang Wang, Xiaoguang Liu, Lizhen Cui, Xiaoxiao Li
6 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
Federated learning (FL) is a privacy-preserving collaboratively machine learning paradigm. Traditional FL requires all data owners (a.k.a. FL clients) to train the same local model. This design is not well-suited for scenarios involving data and/or system heterogeneity. Model-Heterogeneous Personalized FL (MHPFL) has emerged to address this challenge. Existing MHPFL approaches often rely on a public dataset with the same nature as the learning task, or incur high computation and communication costs. To address these limitations, we propose the Federated Semantic Similarity Aggregation (FedSSA) approach for supervised classification tasks, which splits each client’s model into a heterogeneous (structure-different) feature extractor and a homogeneous (structure-same) classification header. It performs local-to-global knowledge transfer via semantic similarity-based header parameter aggregation. In addition, global-to-local knowledge transfer is achieved via an adaptive parameter stabilization strategy which fuses the seen-class parameters of historical local headers with that of the latest global header for each client. FedSSA does not rely on public datasets, while only requiring partial header parameter transmission to save costs. Theoretical analysis proves the convergence of FedSSA. Extensive experiments present that FedSSA achieves up to 3.62% higher accuracy, 15.54 times higher communication efficiency, and 15.52 times higher computational efficiency compared to 7 state-of-the-art MHPFL baselines.
List of keywords
Machine Learning -> ML: Federated learning
219
Empirical Analysis of Dialogue Relation Extraction with Large Language Models
Guozheng Li, Zijie Xu, Ziyu Shang, Jiajun Liu, Ke Ji, Yikai Guo
6 min. talk | August 7th at 15:00 | Session: NLP: Natural Language Processing (2/3)
[+] More
[-] Less
Dialogue relation extraction (DRE) aims to extract relations between two arguments within a dialogue, which is more challenging than standard RE due to the higher person pronoun frequency and lower information density in dialogues. However, existing DRE methods still suffer from two serious issues: (1) hard to capture long and sparse multi-turn information, and (2) struggle to extract golden relations based on partial dialogues, which motivates us to discover more effective methods that can alleviate the above issues. We notice that the rise of large language models (LLMs) has sparked considerable interest in evaluating their performance across diverse tasks. To this end, we initially investigate the capabilities of different LLMs in DRE, considering both proprietary models and open-source models. Interestingly, we discover that LLMs significantly alleviate two issues in existing DRE methods. Generally, we have following findings: (1) scaling up model size substantially boosts the overall DRE performance and achieves exceptional results, tackling the difficulty of capturing long and sparse multi-turn information; (2) LLMs encounter with much smaller performance drop from entire dialogue setting to partial dialogue setting compared to existing methods; (3) LLMs deliver competitive or superior performances under both full-shot and few-shot settings compared to current state-of-the-art; (4) LLMs show modest performances on inverse relations but much stronger improvements on general relations, and they can handle dialogues of various lengths especially for longer sequences.
List of keywords
Natural Language Processing -> NLP: Information extraction
223
Graph Contrastive Learning with Reinforcement Augmentation
Ziyang Liu, Chaokun Wang, Cheng Wu
6 min. talk | August 6th at 11:30 | Session: DM: Mining graphs (1/3)
[+] More
[-] Less
Graph contrastive learning (GCL), designing contrastive objectives to learn embeddings from augmented graphs, has become a prevailing method for extracting embeddings from graphs in an unsupervised manner. As an important procedure in GCL, graph data augmentation (GDA) directly affects the model performance on downstream tasks. Currently, the GCL methods typically treat GDA as independent events, neglecting its continuity. In this paper, we regard the GDA in GCL as a Markov decision process and propose a novel graph reinforcement augmentation framework for GCL. Based on this framework, we design a Graph Advantage Actor-Critic (GA2C) model. We conduct extensive experiments to evaluate GA2C on unsupervised learning, transfer learning, and semi-supervised learning. The experimental results demonstrate the performance superiority of GA2C over the state-of-the-art GCL models. Furthermore, we verify that GA2C is more efficient than the other GCL methods with learnable GDA and provide two examples of chemical molecular graphs from ZINC-2M to demonstrate that GA2C generates meaningful augmented views, where the edge weights reflect the importance of chemical bonds in the molecule.
List of keywords
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Self-supervised Learning
230
Negative Prompt Driven Complementary Parallel Representation for Open-World 3D Object Retrieval
Yang Xu, Yifan Feng, Yue Gao
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
The limited availability of supervised labels (positive information) poses a notable challenge for open-world retrieval. However, negative information is more easily obtained but remains underexploited in current methods. In this paper, we introduce the Negative Prompt Driven Complementary Parallel Representation (NPCP) framework, which navigates the complexities of open-world retrieval through the lens of Negative Prompts. Specifically, we employ the Parallel Exclusive Embedding (PEE) to effectively utilize the prompt information, bilaterally capturing both explicit negative and implicit positive signals. To address the challenges of embedding unification and generalization, our method leverages high-order correlations among objects through the Complementary Structure Tuning (CST), by constructing a complementary hypergraph based on bi-directional and cross-category correlations. We have developed four multimodal datasets for open-world 3D object retrieval with negative prompts: NPMN, NPAB, NPNT, and NPES. Extensive experiments and ablation studies on these four benchmarks demonstrate the superiority of our method over current state-of-the-art approaches.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Image and video retrieval 
Computer Vision -> CV: Representation learning
231
MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator
Xiao-Yin Liu, Xiao-Hu Zhou, Guotao Li, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, Zeng-Guang Hou
6 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
Offline reinforcement learning (RL) faces a significant challenge of distribution shift. Model-free offline RL penalizes the Q value for out-of-distribution (OOD) data or constrains the policy closed to the behavior policy to tackle this problem, but this inhibits the exploration of the OOD region. Model-based offline RL, which uses the trained environment model to generate more OOD data and performs conservative policy optimization within that model, has become an effective method for this problem. However, the current model-based algorithms rarely consider agent robustness when incorporating conservatism into policy. Therefore, the new model-based offline algorithm with a conservative Bellman operator (MICRO) is proposed. This method trades off performance and robustness via introducing the robust Bellman operator into the algorithm. Compared with previous model-based algorithms with robust adversarial models, MICRO can significantly reduce the computation cost by only choosing the minimal Q value in the state uncertainty set. Extensive experiments demonstrate that MICRO outperforms prior RL algorithms in offline RL benchmark and is considerably robust to adversarial perturbations.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Model-based and model learning reinforcement learning
Machine Learning -> ML: Offline reinforcement learning
Machine Learning -> ML: Robustness
234
Spear: Evaluate the Adversarial Robustness of Compressed Neural Models
Chong Yu, Tao Chen, Zhongxue Gan, Jiayuan Fan
12 min. talk | August 7th at 11:30 | Session: CV: Adversarial learning, adversarial attack and defense methods
[+] More
[-] Less
As Artificial Intelligence evolves, the neural models vulnerable to adversarial attacks may produce fatal results in critical applications. This paper mainly discusses the robustness of the compressed neural models facing adversarial attacks. A few studies discuss the interaction between model compression and adversarial attack. However, they focus on the robustness against the traditional attacks designed for the dense models, not the attacks intended explicitly for the compressed models, using sparsity and quantization techniques. Compressed models often have fewer parameters and smaller sizes that are more friendly to resource-limited devices than dense models, so they are widely deployed in various edge and mobile devices. However, introducing the sparsity and quantization into neural models further imposes higher attack risks. A specific adversarial attack method (Spear) is proposed to generate the particular adversarial attack samples for evaluating the robustness of the compressed models. The Spear attack finds minimal perturbations to create the attack samples to maximize the different behaviors between the compressed and dense reference models. We demonstrate the proposed Spear attack technique can generally be applied to various networks and tasks through quantitative and ablation experiments.
List of keywords
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Robustness
247
RoboFusion: Towards Robust Multi-Modal 3D Object Detection via SAM
Ziying Song, Guoxing Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Caiyan Jia, Feiyang Jia, Li Wang
6 min. talk | August 6th at 11:30 | Session: CV: 3D computer vision (1/2)
[+] More
[-] Less
Multi-modal 3D object detectors are dedicated to exploring secure and reliable perception systems for autonomous driving (AD). Although achieving state-of-the-art (SOTA) performance on clean benchmark datasets, they tend to overlook the complexity and harsh conditions of real-world environments. With the emergence of visual foundation models (VFMs), opportunities and challenges are presented for improving the robustness and generalization of multi-modal 3D object detection in AD. Therefore, we propose RoboFusion, a robust framework that leverages VFMs like SAM to tackle out-of-distribution (OOD) noise scenarios. We first adapt the original SAM for AD scenarios named SAM-AD. To align SAM or SAM-AD with multi-modal methods, we then introduce AD-FPN for upsampling the image features extracted by SAM. We employ wavelet decomposition to denoise the depth-guided images for further noise reduction and weather interference. At last, we employ self-attention mechanisms to adaptively reweight the fused features, enhancing informative features while suppressing excess noise. In summary, RoboFusion significantly reduces noise by leveraging the generalization and robustness of VFMs, thereby enhancing the resilience of multi-modal 3D object detection. Consequently, RoboFusion achieves SOTA performance in noisy scenarios, as demonstrated by the KITTI-C and nuScenes-C benchmarks. Code is available at https://github.com/adept-thu/RoboFusion.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)
Robotics -> ROB: Perception
260
A Semi-supervised Molecular Learning Framework for Activity Cliff Estimation
Fang Wu
6 min. talk | August 8th at 15:00 | Session: MTA: Bioinformatics
[+] More
[-] Less
Machine learning (ML) enables accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Their success is based on the principle of similarity at its heart, assuming that similar molecules exhibit close properties. However, activity cliffs challenge this principle, and their presence leads to a sharp decline in the performance of existing ML algorithms, particularly graph-based methods. To overcome this obstacle under a low-data scenario, we propose a novel semi-supervised learning (SSL) method dubbed SemiMol, which employs predictions on numerous unannotated data as pseudo-signals for subsequent training. Specifically, we introduce an additional instructor model to evaluate the accuracy and trustworthiness of proxy labels because existing pseudo-labeling approaches require probabilistic outputs to reveal the model’s confidence and fail to be applied in regression tasks. Moreover, we design a self-adaptive curriculum learning algorithm to progressively move the target model toward hard samples at a controllable pace. Extensive experiments on 30 activity cliff datasets demonstrate that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Health and medicine
Multidisciplinary Topics and Applications -> MTA: Life sciences
Humans and AI -> HAI: Applications
264
VSGT: Variational Spatial and Gaussian Temporal Graph Models for EEG-based Emotion Recognition
Chenyu Liu, Xinliang Zhou, Jiaping Xiao, Zhengri Zhu, Liming Zhai, Ziyu Jia, Yang Liu
6 min. talk | August 8th at 11:30 | Session: HAI: Cognitive modeling
[+] More
[-] Less
Electroencephalogram (EEG), which directly reflects the emotional activity of the brain, has been increasingly utilized for emotion recognition. Most works exploit the spatial and temporal dependencies in EEG to learn emotional feature representations, but they still have two limitations to reach their full potential. First, prior knowledge is rarely used to capture the spatial dependency of brain regions. Second, the cross temporal dependency between consecutive time slices for different brain regions is ignored. To address these limitations, in this paper, we propose Variational Spatial and Gaussian Temporal (VSGT) graph models to investigate the spatial and temporal dependencies for EEG-based emotion recognition. The VSGT has two key components: Variational Spatial Encoder (VSE) and Gaussian Temporal Encoder (GTE). The VSE leverages the upper bound theorem to identify the dynamic spatial dependency based on prior knowledge by the variational Bayesian method. Besides, the GTE exploits the conditional Gaussian graph transform that computes comprehensive temporal dependency between consecutive time slices. Finally, the VSGT utilizes a recurrent structure to calculate the spatial and temporal dependencies for all time slices. Extensive experiments show the superiority of VSGT over state-of-the-art methods on multiple EEG datasets.
List of keywords
Humans and AI -> HAI: Cognitive modeling
Humans and AI -> HAI: Brain sciences
281
Probabilistically Robust Watermarking of Neural Networks
Mikhail Pautov, Nikita Bogdanov, Stanislav Pyatkin, Oleg Rogov, Ivan Oseledets
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (4/6)
[+] More
[-] Less
As deep learning (DL) models are widely and effectively used in Machine Learning as a Service (MLaaS) platforms, there is a rapidly growing interest in DL watermarking techniques that can be used to confirm the ownership of a particular model. Unfortunately, these methods usually produce watermarks susceptible to model stealing attacks. In our research, we introduce a novel trigger set-based watermarking approach that demonstrates resilience against functionality stealing attacks, particularly those involving extraction and distillation. Our approach does not require additional model training and can be applied to any model architecture. The key idea of our method is to compute the trigger set, which is transferable between the source model and the set of proxy models with a high probability. In our experimental study, we show that if the probability of the set being transferable is reasonably high, it can be effectively used for ownership verification of the stolen model. We evaluate our method on multiple benchmarks and show that our approach outperforms current state-of-the-art watermarking techniques in all considered experimental setups.
List of keywords
Machine Learning -> ML: Adversarial machine learning
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Uncertainty in AI -> UAI: Applications
291
CoFInAl: Enhancing Action Quality Assessment with Coarse-to-Fine Instruction Alignment
Kanglei Zhou, Junlin Li, Ruizhi Cai, Liyuan Wang, Xingxing Zhang, Xiaohui Liang
6 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
Action Quality Assessment (AQA) is pivotal for quantifying actions across domains like sports and medical care. Existing methods often rely on pre-trained backbones from large-scale action recognition datasets to boost performance on smaller AQA datasets. However, this common strategy yields suboptimal results due to the inherent struggle of these backbones to capture the subtle cues essential for AQA. Moreover, fine-tuning on smaller datasets risks overfitting. To address these issues, we propose Coarse-to-Fine Instruction Alignment (CoFInAl). Inspired by recent advances in large language model tuning, CoFInAl aligns AQA with broader pre-trained tasks by reformulating it as a coarse-to-fine classification task. Initially, it learns grade prototypes for coarse assessment and then utilizes fixed sub-grade prototypes for fine-grained assessment. This hierarchical approach mirrors the judging process, enhancing interpretability within the AQA framework. Experimental results on two long-term AQA datasets demonstrate CoFInAl achieves state-of-the-art performance with significant correlation gains of 5.49% and 3.55% on Rhythmic Gymnastics and Fis-V, respectively. Our Code is available at https://github.com/ZhouKanglei/CoFInAl_AQA.
List of keywords
Computer Vision -> CV: Action and behavior recognition
Computer Vision -> CV: Video analysis and understanding   
297
Hacking Task Confounder in Meta-Learning
Jingyao Wang, Yi Ren, Zeen Song, Jianqi Zhang, Changwen Zheng, Wenwen Qiang
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (2/6)
[+] More
[-] Less
Meta-learning enables rapid generalization to new tasks by learning knowledge from various tasks. It is intuitively assumed that as the training progresses, a model will acquire richer knowledge, leading to better generalization performance. However, our experiments reveal an unexpected result: there is negative knowledge transfer between tasks, affecting generalization performance. To explain this phenomenon, we conduct Structural Causal Models (SCMs) for causal analysis. Our investigation uncovers the presence of spurious correlations between task-specific causal factors and labels in meta-learning. Furthermore, the confounding factors differ across different batches. We refer to these confounding factors as “Task Confounders". Based on these findings, we propose a plug-and-play Meta-learning Causal Representation Learner (MetaCRL) to eliminate task confounders. It encodes decoupled generating factors from multiple tasks and utilizes an invariant-based bi-level optimization mechanism to ensure their causality for meta-learning. Extensive experiments on various benchmark datasets demonstrate that our work achieves state-of-the-art (SOTA) performance. The code is provided in https://github.com/WangJingyao07/MetaCRL.
List of keywords
Machine Learning -> ML: Meta-learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Machine Learning -> ML: Causality
Machine Learning -> ML: Few-shot learning
313
FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on
Chenhui Wang, Tao Chen, Zhihao Chen, Zhizhong Huang, Taoran Jiang, Qi Wang, Hongming Shan
6 min. talk | August 9th at 10:00 | Session: CV: Applications
[+] More
[-] Less
Despite their impressive generative performance, latent diffusion model-based virtual try-on (VTON) methods lack faithfulness to crucial details of the clothes, such as style, pattern, and text. To alleviate these issues caused by the diffusion stochastic nature and latent supervision, we propose a novel Faithful Latent Diffusion Model for VTON, termed FLDM-VTON. FLDM-VTON improves the conventional latent diffusion process in three major aspects. First, we propose incorporating warped clothes as both the starting point and local condition, supplying the model with faithful clothes priors. Second, we introduce a novel clothes flattening network to constrain generated try-on images, providing clothes-consistent faithful supervision. Third, we devise a clothes-posterior sampling for faithful inference, further enhancing the model performance over conventional clothes-agnostic Gaussian sampling. Extensive experimental results on the benchmark VITON-HD and Dress Code datasets demonstrate that our FLDM-VTON outperforms state-of-the-art baselines and is able to generate photo-realistic try-on images with faithful clothing details.
List of keywords
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation 
322
Bridging the Gap: Learning Pace Synchronization for Open-World Semi-Supervised Learning
Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang
12 min. talk | August 8th at 11:30 | Session: ML: Semi-supervised learning
[+] More
[-] Less
In open-world semi-supervised learning, a machine learning model is tasked with uncovering novel categories from unlabeled data while maintaining performance on seen categories from labeled data. The central challenge is the substantial learning gap between seen and novel categories, as the model learns the former faster due to accurate supervisory information. Moreover, capturing the semantics of unlabeled novel category samples is also challenging due to the missing label information. To address the above issues, we introduce 1) the adaptive synchronizing marginal loss which imposes class-specific negative margins to alleviate the model bias towards seen classes, and 2) the pseudo-label contrastive clustering which exploits pseudo-labels predicted by the model to group unlabeled data from the same category together in the output space. Extensive experiments on benchmark datasets demonstrate that previous approaches may significantly hinder novel class learning, whereas our method strikingly balances the learning pace between seen and novel classes, achieving a remarkable 3% average accuracy increase on the ImageNet dataset. Importantly, we find that fine-tuning the self-supervised pre-trained model significantly boosts the performance, which is overlooked in prior literature. Our code is available at https://github.com/yebo0216best/LPS-main.
List of keywords
Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Weakly supervised learning
334
Explore Internal and External Similarity for Single Image Deraining with Graph Neural Networks
Cong Wang, Wei Wang, Chengjin Yu, Jie Mu
6 min. talk | August 9th at 11:30 | Session: CV: Machine learning for vision
[+] More
[-] Less
Patch-level non-local self-similarity is an important property of natural images. However, most existing methods do not consider this property into neural networks for image deraining, thus affecting recovery performance. Motivated by this property, we find that there exists significant patch recurrence property of a rainy image, that is, similar patches tend to recur many times in one image and its multi-scale images and external images. To better model this property for image detaining, we develop a multi-scale graph network with exemplars, called MSGNN, that contains two branches: 1) internal data-based supervised branch is used to model the internal relations of similar patches from the rainy image itself and its multi-scale images and 2) external data-participated unsupervised branch is used to model the external relations of the similar patches in the rainy image and exemplar. Specifically, we construct a graph model by searching the k-nearest neighboring patches from both the rainy images in a multi-scale framework and the exemplar. After obtaining the corresponding k neighboring patches from the multi-scale images and exemplar, we build a graph and aggregate them in an attentional manner so that the graph can provide more information from similar patches for image deraining. We embed the proposed graph in a deep neural network and train it in an end-to-end manner. Extensive experiments demonstrate that the proposed algorithm performs favorably against eight state-of-the-art methods on five public synthetic datasets and one real-world dataset. The source codes will be available at https://github.com/supersupercong/MSGNN.
List of keywords
Computer Vision -> CV: Applications
Computer Vision -> CV: Computational photography
349
Contrastive and View-Interaction Structure Learning for Multi-view Clustering
Jing Wang, Songhe Feng
6 min. talk | August 8th at 10:00 | Session: ML: Clustering
[+] More
[-] Less
Existing Deep Multi-view Clustering (DMVC) approaches typically concentrate on capturing consensus semantics from multiple views, where contrastive learning is widely used to align view-specific representations of each view. Unfortunately, view-specific representations are extracted from the content information of the corresponding instance, neglecting the relationships among different instances. Furthermore, existing contrastive loss imports numerous false negative pairs that conflict with the clustering objectives. In response to these challenges, we propose a contraStive and viEw-interaction stRucture learning framework for multI-viEw cluStering (SERIES). Our method takes into account the structural relations among instances and boosts the contrastive loss to improve intra-class compactness. Meanwhile, a cross-view dual relation generation mechanism is introduced to achieve the consensus structural graph across multiple views for clustering. Specifically, we initially acquire view-specific representations using multiple graph autoencoders to exploit both content information and structural information. Furthermore, to pull together the same cluster instances, a soft negative pair aware contrastive loss is employed to distinguish the dissimilar instances while attracting similar instances. Thereafter, the view-specific representations are fed into cross-view dual relation generation layers to generate the affinity matrices of each other, aiming to reveal a consistent structural graph across various views. Extensive experiments conducted on six benchmarks illustrate the superiority of our method compared to other state-of-the-art approaches.
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
362
ELF-UA: Efficient Label-Free User Adaptation in Gaze Estimation
Yong Wu, Yang Wang, Sanqing Qu, Zhijun Li, Guang Chen
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
We consider the problem of user-adaptive 3D gaze estimation. The performance of person-independent gaze estimation is limited due to interpersonal anatomical differences. Our goal is to provide a personalized gaze estimation model specifically adapted to a target user. Previous work on user-adaptive gaze estimation requires some labeled images of the target person data to fine-tune the model at test time. However, this can be unrealistic in real-world applications, since it is cumbersome for an end-user to provide labeled images. In addition, previous work requires the training data to have both gaze labels and person IDs. This data requirement makes it infeasible to use some of the available data. To tackle these challenges, this paper proposes a new problem called efficient label-free user adaptation in gaze estimation. Our model only needs a few unlabeled images of a target user for the model adaptation. During offline training, we have some labeled source data without person IDs and some unlabeled person-specific data. Our proposed method uses a meta-learning approach to learn how to adapt to a new user with only a few unlabeled images. Our key technical innovation is to use a generalization bound from domain adaptation to define the loss function in meta-learning, so that our method can effectively make use of both the labeled source data and the unlabeled person-specific data during training. Extensive experiments validate the effectiveness of our method on several challenging benchmarks.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Humans and AI -> HAI: Applications
366
Efficiency Calibration of Implicit Regularization in Deep Networks via Self-paced Curriculum-Driven Singular Value Selection
Zhe Li, Shuo Chen, Jian Yang, Lei Luo
6 min. talk | August 7th at 11:30 | Session: ML: Representation learning
[+] More
[-] Less
The generalization of neural networks has been a major focus of research in deep learning. It is often interpreted as an implicit bias towards solutions with specific properties. Especially, in practical applications, it has been observed that linear neural networks (LNN) tend to favor low-rank solutions for matrix completion tasks. However, most existing methods rely on increasing the depth of the neural network to enhance the low rank of solutions, resulting in higher complexity. In this paper, we propose a new explicit regularization method that calibrates the implicit bias towards low-rank trends in matrix completion tasks. Our approach automatically incorporates smaller singular values into the training process using a self-paced learning strategy, gradually restoring matrix information. By jointly using both implicit and explicit regularization, we effectively capture the low-rank structure of LNN and accelerate its convergence. We also analyze how our proposed penalty term interacts with implicit regularization and provide theoretical guarantees for our new model. To evaluate the effectiveness of our method, we conduct a series of experiments on both simulated and real-world data. Our experimental results clearly demonstrate that our method has better robustness and generalization ability compared with other methods.
List of keywords
Machine Learning -> ML: Representation learning
Data Mining -> DM: Recommender systems
Machine Learning -> ML: Theory of deep learning
376
Higher-Order Argumentation Frameworks: Principles and Gradual Semantics
Leila Amgoud, Dragan Doder, Marie-Christine Lagasquie-Schiex
12 min. talk | August 6th at 15:00 | Session: KRR: Argumentation
[+] More
[-] Less
The paper investigates how to evaluate elements in complex argumentation frameworks, where both arguments and attacks are weighted and might be attacked by arguments. We propose the first gradual semantics that assign a numerical value to every argument and attack. The value represents the acceptance (seriousness) degree of an argument (attack). We start by highlighting various technical challenges facing semantics in such complex settings, including how to deal with attacks vs arguments, and how to combine their values. We present principles that describe different strategies offered to semantics to meet such challenges. Then, we introduce various semantics per strategy. For instance, some semantics evaluate attacks and arguments in the same way while others, called hybrid, treat them differently. Finally, the principles are used to compare the plethora of novel semantics. The final result is a catalogue of semantics with different formal guarantees and behaviours.
List of keywords
Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Common-sense reasoning
387
InfoMatch: Entropy Neural Estimation for Semi-Supervised Image Classification
Qi Han, Zhibo Tian, Chengwei Xia, Kun Zhan
6 min. talk | August 8th at 11:30 | Session: ML: Semi-supervised learning
[+] More
[-] Less
Semi-supervised image classification, leveraging pseudo supervision and consistency regularization, has demonstrated remarkable success. However, the ongoing challenge lies in fully exploiting the potential of unlabeled data. To address this, we employ information entropy neural estimation to utilize the potential of unlabeled samples. Inspired by contrastive learning, the entropy is estimated by maximizing a lower bound on mutual information across different augmented views. Moreover, we theoretically analyze that the information entropy of the posterior of an image classifier is approximated by maximizing the likelihood function of the softmax predictions. Guided by these insights, we optimize our model from both perspectives to ensure that the predicted probability distribution closely aligns with the ground-truth distribution. Given the theoretical connection to information entropy, we name our method InfoMatch. Through extensive experiments, we show its superior performance. The source code is available at https://github.com/kunzhan/InfoMatch.
List of keywords
Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Unsupervised learning
Computer Vision -> CV: Representation learning
395
Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Zhaoxi Mu, Xinyu Yang
6 min. talk | August 8th at 10:00 | Session: NLP: Speech
[+] More
[-] Less
The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.
List of keywords
Natural Language Processing -> NLP: Speech
Machine Learning -> ML: Multi-modal learning
404
Scalable Federated Unlearning via Isolated and Coded Sharding
Yijing Lin, Zhipeng Gao, Hongyang Du, Dusit Niyato, Gui Gui, Shuguang Cui, Jinke Ren
6 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
Federated unlearning has emerged as a promising paradigm to erase the client-level data effect without affecting the performance of collaborative learning models. However, the federated unlearning process often introduces extensive storage overhead and consumes substantial computational resources, thus hindering its implementation in practice. To address this issue, this paper proposes a scalable federated unlearning framework based on isolated sharding and coded computing. We first divide distributed clients into multiple isolated shards across stages to reduce the number of clients being affected. Then, to reduce the storage overhead of the central server, we develop a coded computing mechanism by compressing the model parameters across different shards. In addition, we provide the theoretical analysis of time efficiency and storage effectiveness for the isolated and coded sharding. Finally, extensive experiments on two typical learning tasks, i.e., classification and generation, demonstrate that our proposed framework can achieve better performance than three state-of-the-art frameworks in terms of accuracy, retraining time, storage overhead, and F1 scores for resisting membership inference attacks.
List of keywords
Machine Learning -> ML: Federated learning
Machine Learning -> ML: Trustworthy machine learning
406
AllMatch: Exploiting All Unlabeled Data for Semi-Supervised Learning
Zhiyu Wu, Jinshi Cui
12 min. talk | August 8th at 11:30 | Session: ML: Semi-supervised learning
[+] More
[-] Less
Existing semi-supervised learning algorithms adopt pseudo-labeling and consistency regulation techniques to introduce supervision signals for unlabeled samples. To overcome the inherent limitation of threshold-based pseudo-labeling, prior studies have attempted to align the confidence threshold with the evolving learning status of the model, which is estimated through the predictions made on the unlabeled data. In this paper, we further reveal that classifier weights can reflect the differentiated learning status across categories and consequently propose a class-specific adaptive threshold mechanism. Additionally, considering that even the optimal threshold scheme cannot resolve the problem of discarding unlabeled samples, a binary classification consistency regulation approach is designed to distinguish candidate classes from negative options for all unlabeled samples. By combining the above strategies, we present a novel SSL algorithm named AllMatch, which achieves improved pseudo-label accuracy and a 100% utilization ratio for the unlabeled data. We extensively evaluate our approach on multiple benchmarks, encompassing both balanced and imbalanced settings. The results demonstrate that AllMatch consistently outperforms existing state-of-the-art methods.
List of keywords
Machine Learning -> ML: Semi-supervised learning
418
IntensPure: Attack Intensity-aware Secondary Domain Adaptive Diffusion for Adversarial Purification
Eun-Gi Lee, Moon Seok Lee, Jae Hyun Yoon, Seok Bong Yoo
6 min. talk | August 7th at 11:30 | Session: CV: Adversarial learning, adversarial attack and defense methods
[+] More
[-] Less
Adversarial attacks pose a severe threat to the accuracy of person re-identification (re-ID) systems, a critical security technology. Adversarial purification methods are promising approaches for defending against comprehensive attacks, including unseen ones. However, re-ID testing identities (IDs) are unseen, requiring more sophisticated purification than other classification tasks for adversarial defense. We propose IntensPure, an adversarial purification method in person re-ID that quantifies attack intensity via ID stability and attribute inconsistency to customize purification strength. Based on the estimated attack intensity, IntensPure employs secondary domain adaptive diffusion focused on purifying the low- and mid-frequency coefficients vulnerable to re-ID attacks. This method significantly reduces computational costs compared to the conventional diffusion method. For elaborate purification, IntensPure performs a directional diffusion process and refinements, leveraging the directional characteristics of secondary images. The experimental results on diverse attacks demonstrate that IntensPure outperforms the existing methods in terms of rank-1 accuracy.
List of keywords
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Recognition (object detection, categorization)
434
Efficient Tuning and Inference for Large Language Models on Textual Graphs
Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang
6 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Rich textual and topological information of textual graphs need to be modeled in real-world applications such as webpages, e-commerce, and academic articles. Practitioners have been long following the path of adopting a shallow text encoder and a subsequent graph neural network (GNN) to solve this problem. In light of recent advancements in large language models (LLMs), it is apparent that integrating LLMs for enhanced textual encoding can substantially improve the performance of textual graphs. Nevertheless, the efficiency of these methods poses a significant challenge. In this paper, we propose ENGINE, a parameter- and memory-efficient fine-tuning method for textual graphs with an LLM encoder. The key insight is to combine the LLMs and GNNs through a tunable side structure, which significantly reduces the training complexity without impairing the joint model’s capacity. Extensive experiments on textual graphs demonstrate our method’s effectiveness by achieving the best model performance, meanwhile having the lowest training cost compared to previous methods. Moreover, we introduce two variants with caching and dynamic early exit to further enhance training and inference speed. Specifically, caching accelerates ENGINE’s training by 12x, and dynamic early exit achieves up to 5x faster inference with a negligible performance drop (at maximum 1.17% relevant drop across 7 datasets). Our codes are available at: https://github.com/ZhuYun97/ENGINE.
List of keywords
Machine Learning -> ML: Sequence and graph learning
Data Mining -> DM: Mining graphs
Natural Language Processing -> NLP: Language models
Machine Learning -> ML: Supervised Learning
439
Exploiting Multi-Label Correlation in Label Distribution Learning
Zhiqiang Kou, Jing Wang, Jiawei Tang, Yuheng Jia, Boyu Shi, Xin Geng
12 min. talk | August 9th at 11:30 | Session: ML: Multi-label learning
[+] More
[-] Less
Label Distribution Learning (LDL) is a novel machine learning paradigm that assigns label distribution to each instance. Numerous LDL methods proposed to leverage label correlation in the learning process to solve the exponential-sized output space; among these, many exploited the low-rank structure of label distribution to capture label correlation. However, recent research has unveiled that label distribution matrices typically maintain full rank, posing a challenge to approaches relying on low-rank label correlation. Notably, low-rank label correlation finds widespread adoption in multi-label learning (MLL) literature due to the often low-rank nature of multi-label matrices. Inspired by that, we introduce an auxiliary MLL process within the LDL framework, focusing on capturing low-rank label correlation within this auxiliary MLL component rather than the LDL itself. By doing so, we adeptly exploited low-rank label correlation in our LDL methods. We conduct comprehensive experiments and demonstrate that our methods are superior to existing LDL methods. Besides, the ablation studies justify the advantages of exploiting low-rank label correlation in the auxiliary MLL.
List of keywords
Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Applications
Machine Learning -> ML: Classification
440
Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition
Zuan Gao, Yuxin Wang, Yadong Qu, Boqiang Zhang, Zixiao Wang, Jianjun Xu, Hongtao Xie
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1\% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
465
EvaNet: Elevation-Guided Flood Extent Mapping on Earth Imagery
Mirza Tanzim Sami, Da Yan, Saugat Adhikari, Lyuheng Yuan, Jiao Han, Zhe Jiang, Jalal Khalil, Yang Zhou
12 min. talk | August 6th at 11:30 | Session: CV: 3D computer vision (1/2)
[+] More
[-] Less
Accurate and timely mapping of flood extent from high resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which cannot segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectral features. Thanks to the digital elevation model (DEM) data readily available from sources such as United States Geological Survey (USGS), this work explores the use of an elevation map to improve flood extent mapping. We propose, EvaNet, an elevation-guided segmentation model based on the encoder-decoder architecture with two novel techniques: (1) a loss function encoding the physical law of gravity that if a location is flooded (resp. dry), then its adjacent locations with a lower (resp. higher) elevation must also be flooded (resp. dry); (2) a new (de)convolution operation that integrates the elevation map by a location-sensitive gating mechanism to regulate how much spectral features flow through adjacent layers. Extensive experiments show that EvaNet significantly outperforms the U-Net baselines, and works as a perfect drop-in replacement for U-Net in existing solutions to flood extent mapping. EvaNet is open-sourced at https://github.com/MTSami/EvaNet.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Segmentation
Data Mining -> DM: Mining spatial and/or temporal data
478
Dual Enhancement in ODI Super-Resolution: Adapting Convolution and Upsampling to Projection Distortion
Xiang Ji, Changqiao Xu, Lujie Zhong, Shujie Yang, Han Xiao, Gabriel-Miro Muntean
6 min. talk | August 6th at 11:30 | Session: CV: 3D computer vision (1/2)
[+] More
[-] Less
Omnidirectional images (ODIs) demand considerably higher resolution to ensure high quality across all viewports. Traditional convolutional neural networks (CNN)-based single-image super-resolution (SISR) networks, however, are not effective for spherical ODIs. This is due to the uneven pixel density distribution and varying texture complexity in different regions that arise when projecting from a sphere to a plane. Additionally, the computational and memory costs associated with large-sized ODIs present a challenge for real-world application. To address these issues, we propose an efficient distortion-adaptive super-resolution network (ODA-SRN). Specifically, ODA-SRN employs a series of specially designed Distortion Attention Block Groups (DABG) as its backbone. Our Distortion Attention Blocks (DABs) utilize multi-segment parameterized convolution to generate dynamic filters, which compensate for distortion and texture fading during feature extraction. Moreover, we introduce an upsampling scheme that accounts for the dependence of pixel position and distortion degree to achieve pixel-level distortion offset. A comprehensive set of results demonstrates that our ODA-SRN significantly improves the super-resolution performance for ODIs, both quantitatively and qualitatively, when compared to other state-of-the-art methods.
List of keywords
Computer Vision -> CV: 3D computer vision
Agent-based and Multi-agent Systems -> MAS: Human-agent interaction
Computer Vision -> CV: Applications
Computer Vision -> CV: Structural and model-based approaches, knowledge representation and reasoning
479
Structure-Aware Spatial-Temporal Interaction Network for Video Shadow Detection
Housheng Wei, Guanyu Xing, Jingwei Liao, Yanci Zhang, Yanli Liu
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Video shadow detection faces significant challenges due to ambiguous semantics and variable shapes. Existing video shadow detection algorithms typically overlook the fine shadow details, resulting in inconsistent detection between consecutive frames in complex real-world video scenarios. To address this issue, we propose a spatial-temporal feature interaction strategy, which refines and enhances global shadow semantics with local prior features in the modeling of shadow relations between frames. Moreover, a structure-aware shadow prediction module is proposed, which focuses on modeling the distance relation between local shadow edges and regions. Quantitative experimental results demonstrate that our approach significantly outperforms the state-of-the-art methods, providing stable and consistent shadow detection results in complex video shadow scenarios.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding   
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Video analysis and understanding   
481
Self-Supervised Monocular Depth Estimation in the Dark: Towards Data Distribution Compensation
Haolin Yang, Chaoqiang Zhao, Lu Sheng, Yang Tang
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
Nighttime self-supervised monocular depth estimation has received increasing attention in recent years. However, using night images for self-supervision is unreliable because the photometric consistency assumption is usually violated in the videos taken under complex lighting conditions. Even with domain adaptation or photometric loss repair, performance is still limited by the poor supervision of night images on trainable networks. In this paper, we propose a self-supervised nighttime monocular depth estimation method that does not use any night images during training. Our framework utilizes day images as a stable source for self-supervision and applies physical priors (e.g., wave optics, reflection model and read-shot noise model) to compensate for some key day-night differences. With day-to-night data distribution compensation, our framework can be trained in an efficient one-stage self-supervised manner. Though no nighttime images are considered during training, qualitative and quantitative results demonstrate that our method achieves SoTA depth estimating results on the challenging nuScenes-Night and RobotCar-Night compared with existing methods.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Other
Computer Vision -> CV: Scene analysis and understanding   
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
485
SwiftThief: Enhancing Query Efficiency of Model Stealing by Contrastive Learning
Jeonghyun Lee, Sungmin Han, Sangkyun Lee
6 min. talk | August 7th at 10:00 | Session: ETF: Safety and robustness
[+] More
[-] Less
Model-stealing attacks are emerging as a severe threat to AI-based services because an adversary can create models that duplicate the functionality of the black-box AI models inside the services with regular query-based access. To avoid detection or query costs, the model-stealing adversary must consider minimizing the number of queries to obtain an accurate clone model. To achieve this goal, we propose SwiftThief, a novel model-stealing framework that utilizes both queried and unqueried data to reduce query complexity. In particular, SwiftThief uses contrastive learning, a recent technique for representation learning. We formulate a new objective function for model stealing consisting of self-supervised (for abundant unqueried inputs from public datasets) and soft-supervised (for queried inputs) contrastive losses, jointly optimized with an output matching loss (for queried inputs). In addition, we suggest a new sampling strategy to prioritize rarely queried classes to improve attack performance. Our experiments proved that SwiftThief could significantly enhance the efficiency of model-stealing attacks compared to the existing methods, achieving similar attack performance using only half of the query budgets of the competing approaches. Also, SwiftThief showed high competence even when a defense was activated for the victims.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Multidisciplinary Topics and Applications -> MTA: Security and privacy
486
PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation
Deyi Ji, Wenwei Jin, Hongtao Lu, Feng Zhao
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel Pseudo Multi-Perspective Transformer network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.
List of keywords
Computer Vision -> CV: Scene analysis and understanding   
Computer Vision -> CV: Segmentation
496
Online Submodular Maximization via Adaptive Thresholds
Zhengchen Yang, Jiping Zheng
12 min. talk | August 7th at 11:30 | Session: S: Combinatorial search and optimisation
[+] More
[-] Less
Submodular function maximization has been studied extensively in recent years due to its numerous applications in machine learning and artificial intelligence. We study a natural online variant of this problem on massive streaming data in which elements arrive one-by-one and the algorithm has to maintain a solution under cardinality constraint, i.e., k. Upon arrival of an element, the algorithm to maximize a monotone submodular function has to decide whether to accept the element and may replace a previously chosen element. Existing algorithms cannot simultaneously achieve optimal performance in terms of competitive ratio, memory complexity and running time. Also, the algorithm with best competitive ratio performs poorly in practice. In this paper, we propose a new algorithm OnlineAdaptive with optimal performance by exploiting adaptive thresholds to decide the acceptance of arriving elements by replacement. We prove that the competitive ratio of OnlineAdaptive is at least 1/4, and the ratio is about 0.2959 when k>=4 and approaches 0.3178 when k tends to infinity. In addition, OnlineAdaptive only needs O(k) memory and just performs one oracle per element. Experiments on diverse datasets confirm that OnlineAdaptive outperforms existing algorithms in both quality and efficiency.
List of keywords
Search -> S: Combinatorial search and optimisation
Search -> S: Heuristic search
509
EVE: Efficient Zero-Shot Text-Based Video Editing With Depth Map Guidance and Temporal Consistency Constraints
Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, Qingpei Guo
6 min. talk | August 8th at 11:30 | Session: CV: Image and video synthesis and generation (1/2)
[+] More
[-] Less
Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and Efficient zero-shot Video Editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark named ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE achieves a satisfactory trade-off between performance and efficiency. Codebase, datasets, and video editing demos are available at https://github.com/alipay/Ant-Multi-Modal-Framework/blob/main/prj/EVE.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Applications
524
OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer
Shengjian Wu, Li Sun, Qingli Li
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
DEtection TRansformer (DETR) becomes a dominant paradigm, mainly due to its common architecture with high accuracy and no post-processing. However, DETR suffers from unstable training dynamics. It consumes more data and epochs to converge compared with CNN-based detectors. This paper aims to stabilize DETR training through the online distillation. It utilizes a teacher model, accumulated by Exponential Moving Average (EMA), and distills its knowledge into the online model in following three aspects. First, the matching relation between object queries and ground truth (GT) boxes in the teacher is employed to guide the student, so queries within the student are not only assigned labels based on their own predictions, but also refer to the matching results from the teacher. Second, the teacher’s initial query is given to the online student, and its prediction is directly constrained by the corresponding output from the teacher. Finally, the object queries from teacher’s different decoding stages are used to build the auxiliary groups to accelerate the convergence. For each GT, two queries with the least matching costs are selected into this extra group, and they predict the GT box and participate the optimization. Extensive experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
549
MLP-DINO: Category Modeling and Query Graphing with Deep MLP for Object Detection
Guiping Cao, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang, Yaowei Wang
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Popular transformer-based detectors detect objects in a one-to-one manner, where both the bounding box and category of each object are predicted only by the single query, leading to the box-sensitive category predictions. Additionally, the initialization of positional queries solely based on the predicted confidence scores or learnable embeddings neglects the significant spatial interrelation between different queries. This oversight leads to an imbalanced spatial distribution of queries (SDQ). In this paper, we propose a new MLP-DINO model to address these issues. Firstly, we present a new Query-Independent Category Supervision (QICS) approach for modeling categories information, decoupling the sensitive bounding box prediction process to improve the detection performance. Additionally, to further improve the category predictions, we introduce a deep MLP model into transformer-based detection framework to capture the long-range and short-range information simultaneously. Thirdly, to balance the SDQ, we design a novel Graph-based Query Selection (GQS) method that distributes each query point in a discrete manner by graphing the spatial information of queries to cover a broader range of potential objects, significantly enhancing the hit-rate of queries. Experimental results on COCO indicate that our MLP-DINO achieves 54.6% AP with only 44M parame ters under 36-epoch setting, greatly outperforming the original DINO by +3.7% AP with fewer parameters and FLOPs. The source codes will be available at https://github.com/Med-Process/MLP-DINO.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding   
Machine Learning -> ML: Multi-label learning
554
Feature Norm Regularized Federated Learning: Utilizing Data Disparities for Model Performance Gains
Ke Hu, Liyao Xiang, Peng Tang, Weidong Qiu
6 min. talk | August 8th at 10:00 | Session: ML: Federated learning (2/2)
[+] More
[-] Less
Federated learning (FL) is a machine learning paradigm that aggregates knowledge and utilizes computational power from multiple participants to train a global model. However, a commonplace challenge—non-independent and identically distributed (non-i.i.d.) data across participants—can lead to significant divergence in model updates, thus diminishing training efficacy. In this paper, we propose the Feature Norm Regularized Federated Learning (FNR-FL) algorithm to tackle the non-i.i.d challenge. FNR-FL incorporates class average feature norms into the loss function by a straightforward yet effective regularization strategy. The core idea of FNR-FL is to penalize the deviations in the update directions of local models caused by the non-i.i.d data. Theoretically, we provide convergence guarantees for FNR-FL when training under non-i.i.d scenarios. Practically, our comprehensive experimental evaluations demonstrate that FNR-FL significantly outperforms existing FL algorithms in terms of test accuracy, and maintains a competitive convergence rate with lower communication overhead and shorter duration. Compared to FedAvg, FNR-FL exhibits a 66.24% improvement in accuracy and an 11.40% reduction in training time, underscoring its enhanced effectiveness and efficiency. The code is available on GitHub at: https://github.com/LonelyMoonDesert/FNR-FL.
List of keywords
Machine Learning -> ML: Federated learning
Machine Learning -> ML: Optimization
Machine Learning -> ML: Robustness
Machine Learning -> ML: Supervised Learning
564
Towards Counterfactual Fairness-aware Domain Generalization in Changing Environments
Yujie Lin, Chen Zhao, Minglai Shao, Baoluo Meng, Xujiang Zhao, Haifeng Chen
6 min. talk | August 8th at 15:00 | Session: ML: Time series and data streams
[+] More
[-] Less
Recognizing domain generalization as a commonplace challenge in machine learning, data distribution might progressively evolve across a continuum of sequential domains in practical scenarios. While current methodologies primarily concentrate on bolstering model effectiveness within these new domains, they tend to neglect issues of fairness throughout the learning process. In response, we propose an innovative framework known as Disentanglement for Counterfactual Fairness-aware Domain Generalization (DCFDG). This approach adeptly removes domain-specific information and sensitive information from the embedded representation of classification features. To scrutinize the intricate interplay between semantic information, domain-specific information, and sensitive attributes, we systematically partition the exogenous factors into four latent variables. By incorporating fairness regularization, we utilize semantic information exclusively for classification purposes. Empirical validation on synthetic and authentic datasets substantiates the efficacy of our approach, demonstrating elevated accuracy levels while ensuring the preservation of fairness amidst the evolving landscape of continuous domains.
List of keywords
Machine Learning -> ML: Time series and data streams
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
Machine Learning -> ML: Causality
Machine Learning -> ML: Generative models
567
Class-Specific Semantic Generation and Reconstruction Learning for Open Set Recognition
Liu Haoyang, Yaojin Lin, Peipei Li, Jun Hu, Xuegang Hu
6 min. talk | August 7th at 15:00 | Session: DM: Data Mining (1/2)
[+] More
[-] Less
Open set recognition is a crucial research theme for open-environment machine learning. For this problem, a common solution is to learn compact representations of known classes and identify unknown samples by measuring deviations from these known classes. However, the aforementioned methods (1) lack open training consideration, which is detrimental to the fitting of known classes, and (2) recognize unknown classes on an inadequate basis, which limits the accuracy of recognition. In this study, we propose an open reconstruction learning framework that learns a union boundary region of known classes to characterize unknown space. This facilitates the isolation of known space from unknown space to represent known classes compactly and provides a more reliable recognition basis from the perspective of both known and unknown space. Specifically, an adversarial constraint is used to generate class-specific boundary samples. Then, the known classes and approximate unknown space are fitted with manifolds represented by class-specific auto-encoders. Finally, the auto-encoders output the reconstruction error in terms of known and unknown spaces to recognize samples. Extensive experimental results show that the proposed method outperforms existing advanced methods and achieves new stateof-the-art performance. The code is available at https://github.com/Ashowman98/CSGRL.
List of keywords
Data Mining -> DM: Other
Data Mining -> DM: Anomaly/outlier detection
577
Learning with Posterior Sampling for Revenue Management under Time-varying Demand
Kazuma Shimizu, Junya Honda, Shinji Ito, Shinji Nakadai
6 min. talk | August 8th at 15:00 | Session: ML: Machine Learning (6/6)
[+] More
[-] Less
This paper discusses the revenue management (RM) problem to maximize revenue by pricing items or services. One challenge in this problem is that the demand distribution is unknown and varies over time in real applications such as airline and retail industries. In particular, the time-varying demand has not been well studied under scenarios of unknown demand due to the difficulty of jointly managing the remaining inventory and estimating the demand. To tackle this challenge, we first introduce an episodic generalization of the RM problem motivated by typical application scenarios. We then propose a computationally efficient algorithm based on posterior sampling, which effectively optimizes prices by solving linear programming. We derive a Bayesian regret upper bound of this algorithm for general models where demand parameters can be correlated between time periods, while also deriving a regret lower bound for generic algorithms. Our empirical study shows that the proposed algorithm performs better than other benchmark algorithms and comparably to the optimal policy in hindsight. We also propose a heuristic modification of the proposed algorithm, which further efficiently learns the pricing policy in the experiments. An extended version of this paper with appendixes is available at: http://arxiv.org/abs/2405.04910.
List of keywords
Machine Learning -> ML: Online learning
Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Multi-armed bandits
586
ParsNets: A Parsimonious Composition of Orthogonal and Low-Rank Linear Networks for Zero-Shot Learning
Jingcai Guo, Qihua Zhou, Xiaocheng Lu, Ruibin Li, Ziming Liu, Jie Zhang, Bo Han, Junyang Chen, Xin Xie, Song Guo
12 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (4/6)
[+] More
[-] Less
This paper provides a novel parsimonious yet efficient design for zero-shot learning (ZSL), dubbed ParsNets, in which we are interested in learning a composition of on-device friendly linear networks, each with orthogonality and low-rankness properties, to achieve equivalent or better performance against deep models. Concretely, we first refactor the core module of ZSL, i.e., the visual-semantics mapping function, into several base linear networks that correspond to diverse components of the semantic space, wherein the complex nonlinearity can be collapsed into simple local linearities. Then, to facilitate the generalization of local linearities, we construct a maximal margin geometry on the learned features by enforcing low-rank constraints on intra-class samples and high-rank constraints on inter-class samples, resulting in orthogonal subspaces for different classes. To enhance the model’s adaptability and counterbalance the over-/under-fittings, a set of sample-wise indicators is employed to select a sparse subset from these base linear networks to form a composite semantic predictor for each sample. Notably, maximal margin geometry can guarantee the diversity of features and, meanwhile, local linearities guarantee efficiency. Thus, our ParsNets can generalize better to unseen classes and can be deployed flexibly on resource-constrained devices.
List of keywords
Machine Learning -> ML: Cost-sensitive learning
Machine Learning -> ML: Ensemble methods
Machine Learning -> ML: Few-shot learning
Machine Learning -> ML: Learning sparse models
603
A Swap Relaxation-Based Local Search for the Latin Square Completion Problem
Zhenxuan Xie, Zhipeng Lü, Zhouxing Su, Chu-Min Li, Junwen Ding, Yuxuan Wang
6 min. talk | August 8th at 10:00 | Session: S: Search
[+] More
[-] Less
The Latin square completion (LSC) problem aims to assign n symbols to the empty cells of a partially filled Latin square such that in each row and each column, each symbol appears exactly once. In this paper, we propose a swap relaxation-based fast local search algorithm called SRLS for solving the LSC problem. First, it introduces a novel search space definition, which forbids row conflicts based on which a swap-based neighborhood is defined. Second, a color domain relaxation technique is employed in the swap-based neighborhood by temporarily accepting the violation of some constraints to connect high-quality solutions. Third, two effective scoring functions are adopted to select neighborhood moves minimizing the number of conflicting edges as well as the number of color domain violations. Finally, SRLS employs an adaptive restart mechanism to balance the exploitation and exploration of the search. Extensive experiments on 1819 public benchmark instances demonstrate that SRLS outperforms the state-of-the-art algorithms in the literature in terms of both success rate and computational efficiency.
List of keywords
Search -> S: Local search
Search -> S: Heuristic search
Search -> S: Meta-reasoning and meta-heuristics
Search -> S: Combinatorial search and optimisation
605
LLM-based Multi-Level Knowledge Generation for Few-shot Knowledge Graph Completion
Qian Li, Zhuo Chen, Cheng Ji, Shiqi Jiang, Jianxin Li
6 min. talk | August 7th at 15:00 | Session: DM: Data Mining (1/2)
[+] More
[-] Less
Knowledge Graphs (KGs) are pivotal in various NLP applications but often grapple with incompleteness, especially due to the long-tail problem where infrequent, unpopular relationships drastically reduce the KG completion performance. In this paper, we focus on Few-shot Knowledge Graph Completion (FKGC), a task addressing these gaps in long-tail scenarios. Amidst the rapid evolution of Large Language Models, we propose a generation-based FKGC paradigm facilitated by LLM distillation. Our MuKDC framework employs multi-level knowledge distillation for few-shot KG completion, generating supplementary knowledge to mitigate data scarcity in few-shot environments. MuKDC comprises two primary components: Multi-level Knowledge Generation, which enriches the KG at various levels, and Consistency Assessment, to ensure the coherence and reliability of the generated knowledge. Most notably, our method achieves SOTA results in both FKGC and multi-modal FKGC benchmarks, significantly advancing KG completion and enhancing the understanding and application of LLMs in structured knowledge generation and assessment.
List of keywords
Data Mining -> DM: Knowledge graphs and knowledge base completion
Natural Language Processing -> NLP: Applications
616
A Multi-Valued Decision Diagram-Based Approach to Constrained Optimal Path Problems over Directed Acyclic Graphs
Mingwei Zhang, Liangda Fang, Zhenhao Gu, Quanlong Guan, Yong Lai
6 min. talk | August 8th at 10:00 | Session: CSO: Constraint Satisfaction and Optimization
[+] More
[-] Less
Numerous combinatorial optimization problems can be reduced to the optimal path problem over directed acyclic graphs (DAGs). The constrained version of the optimal path problem requires the solution to satisfy a given logical constraint. BDD-constrained search (BCS) is an efficient algorithm for the constrained optimal path problem over DAGs. This algorithm considers edges as variables and constraints as Boolean functions and maintains constraints via binary decision diagrams (BDDs), a compact form of Boolean functions. However, BCS involves redundant operations during the search process. To reduce these redundant operations, we use vertices instead of edges as variables and hence represent constraints as multi-valued functions. Due to the multi-valued representation of constraints, we propose a novel algorithm, namely MDD-constrained search (MCS), by using multi-valued decision diagrams (MDDs) instead of BDDs, an efficient representation of multi-valued functions. In addition, we improve MCS via domain reduction in multi-valued functions. Experimental results prove that our proposed algorithm outperforms BCS.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Applications
Constraint Satisfaction and Optimization -> CSO: Solvers and tools
Knowledge Representation and Reasoning -> KRR: Knowledge compilation
620
GenSeg: On Generating Unified Adversary for Segmentation
Yuxuan Zhang, Zhenbo Shi, Wei Yang, Shuchang Wang, Shaowei Wang, Yinxing Xue
6 min. talk | August 8th at 10:00 | Session: CV: Segmentation
[+] More
[-] Less
Great advancements in semantic, instance, and panoptic segmentation have been made in recent years, yet the top-performing models remain vulnerable to imperceptible adversarial perturbation. Current attacks on segmentation primarily focus on a single task, and these methods typically rely on iterative instance-specific strategies, resulting in limited attack transferability and low efficiency. In this paper, we propose GenSeg, a Generative paradigm that creates unified adversaries for Segmentation tasks. In particular, we propose an intermediate-level objective to enhance attack transferability, including a mutual agreement loss for feature deviation, and a prototype obfuscating loss to disrupt intra-class and inter-class relationships. Moreover, GenSeg crafts an adversary in a single forward pass, significantly boosting the attack efficiency. Besides, we unify multiple segmentation tasks to GenSeg in a novel category-and-mask view, which makes it possible to attack these segmentation tasks within this unified framework, and conduct cross-domain and cross-task attacks as well. Extensive experiments demonstrate the superiority of GenSeg in black-box attacks compared with state-of-the-art attacks. To our best knowledge, GenSeg is the first approach capable of conducting cross-domain and cross-task attacks on segmentation tasks, which are closer to real-world scenarios.
List of keywords
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
634
Enhancing Scalability of Metric Differential Privacy via Secret Dataset Partitioning and Benders Decomposition
Chenxi Qiu
6 min. talk | August 7th at 10:00 | Session: CSO: Constraint optimization problems
[+] More
[-] Less
Metric Differential Privacy (mDP) extends the concept of Differential Privacy (DP) to serve as a new paradigm of data perturbation. It is designed to protect secret data represented in general metric space, such as text data encoded as word embeddings or geo-location data on the road network or grid maps. To derive an optimal data perturbation mechanism under mDP, a widely used method is linear programming (LP), which, however, might suffer from a polynomial explosion of decision variables, rendering it impractical in large-scale mDP. In this paper, our objective is to develop a new computation framework to enhance the scalability of the LP-based mDP. Considering the connections established by the mDP constraints among the secret records, we partition the original secret dataset into various subsets. Building upon the partition, we reformulate the LP problem for mDP and solve it via Benders Decomposition, which is composed of two stages: (1) a master program to manage the perturbation calculation across subsets, and (2) a set of subproblems, each managing the perturbation derivation within a subset. Our experimental results on multiple datasets, including geo-location data in the road network/grid maps, text data, and synthetic data, underscore our proposed mechanism’s superior scalability and efficiency.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Constraint programming
Multidisciplinary Topics and Applications -> MTA: Security and privacy
638
Learning a Spiking Neural Network for Efficient Image Deraining
Tianyu Song, Guiyue Jin, Pengpeng Li, Kui Jiang, Xiang Chen, Jiyu Jin
6 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Recently, spiking neural networks (SNNs) have demonstrated substantial potential in computer vision tasks. In this paper, we present an Efficient Spiking Deraining Network, called ESDNet. Our work is motivated by the observation that rain pixel values will lead to a more pronounced intensity of spike signals in SNNs. However, directly applying deep SNNs to image deraining task still remains a significant challenge. This is attributed to the information loss and training difficulties that arise from discrete binary activation and complex spatiotemporal dynamics. To this end, we develop a spiking residual block to convert the input into spike signals, then adaptively optimize the membrane potential by introducing attention weights to adjust spike responses in a data-driven manner, alleviating information loss caused by discrete binary activation. By this way, our ESDNet can effectively detect and analyze the characteristics of rain streaks by learning their fluctuations. This also enables better guidance for the deraining process and facilitates high-quality image reconstruction. Instead of relying on the ANN-SNN conversion strategy, we introduce a gradient proxy strategy to directly train the model for overcoming the challenge of training. Experimental results show that our approach gains comparable performance against ANN-based methods while reducing energy consumption by 54%. The code source is available at https://github.com/MingTian99/ESDNet.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Applications
Computer Vision -> CV: Computational photography
652
Revitalizing Real Image Deraining via a Generic Paradigm towards Multiple Rainy Patterns
Xin Li, Yuxin Feng, Fan Zhou, Yun Liang, Zhuo Su
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
Synthetic data-driven methods perform well on image rain removal task, but they still face many challenges in real rainfall scenarios due to the complexity and diversity of rainy patterns. In this paper, we propose a new generic paradigm for real image deraining from the perspective of synthesizing data covering more rainy patterns and constructing image rain removal networks with strong generalization performance. Firstly, instead of simply superimposing rain layers, we integrate various rainy patterns and design a phenomenal pipeline that incorporates multiple degradation types. Secondly, we construct a Patterns-aware Rain Removal Network (PRRN), which learns from both synthetic and real data simultaneously. In addition, to eliminate the inevitable distribution differences between synthetic and real data, we design a new Multi-representation Inter-domain Alignment Module (MIAM) in PRRN. By using multiple parallel submodules, MIAM achieves alignment of data domains in multiple feature subspaces. Based on several authoritative objective evaluation metrics, we successfully validate the effectiveness and robustness of the proposed method in real scenarios through extensive experiments carried out on five challenging real datasets.
List of keywords
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
669
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen
6 min. talk | August 7th at 11:30 | Session: ML: Representation learning
[+] More
[-] Less
Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.
List of keywords
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Self-supervised Learning
Natural Language Processing -> NLP: Speech
670
Graph Collaborative Expert Finding with Contrastive Learning
Qiyao Peng, Wenjun Wang, Hongtao Liu, Cuiying Huo, Minglai Shao
6 min. talk | August 6th at 15:00 | Session: DM: Mining graphs (2/3)
[+] More
[-] Less
In Community Question Answering (CQA) websites, most current expert finding methods often model expert embeddings from textual features and optimize them with expert-question first-order interactions, i.e., this expert has answered this question. In this paper, we try to address the limitation of current models that typically neglect the intrinsic high-order connectivity within expert-question interactions, which is pivotal for collaborative effects. We introduce an innovative and simple approach: by conceptualizing expert-question interactions as a bipartite graph, and then we propose a novel graph-based expert finding method based on contrastive learning to effectively capture both first-order and intricate high-order connectivity, named CGEF. Specifically, we employ a question encoder to model questions from titles and employ the graph attention network to recursively propagate embeddings. Besides, to alleviate the problem of sparse interactions, we devise two auxiliary tasks to enhance expert modeling. First, we generate multiple views of one expert, including: 1) behavior-level augmentation drops interaction edges randomly in the graph; 2) interest-level augmentation randomly replaces question titles with tags in the graph. Then we maximize the agreement between one expert and the corresponding augmented expert on a specific view. In this way, the model can effectively inject collaborative signals into expert modeling. Extensive experiments on six CQA datasets demonstrate significant improvements compared with recent methods.
List of keywords
Data Mining -> DM: Mining graphs
Data Mining -> DM: Mining text, web, social media
Data Mining -> DM: Networks
Data Mining -> DM: Recommender systems
671
Boosting Efficiency in Task-Agnostic Exploration through Causal Knowledge
Yupei Yang, Biwei Huang, Shikui Tu, Lei Xu
6 min. talk | August 8th at 15:00 | Session: ML: Reinforcement learning (2/2)
[+] More
[-] Less
The effectiveness of model training heavily relies on the quality of available training resources. However, budget constraints often impose limitations on data collection efforts. To tackle this challenge, we introduce causal exploration in this paper, a strategy that leverages the underlying causal knowledge for both data collection and model training. We, in particular, focus on enhancing the sample efficiency and reliability of the world model learning within the domain of task-agnostic reinforcement learning. During the exploration phase, the agent actively selects actions expected to yield causal insights most beneficial for world model training. Concurrently, the causal knowledge is acquired and incrementally refined with the ongoing collection of data. We demonstrate that causal exploration aids in learning accurate world models using fewer data and provide theoretical guarantees for its convergence. Empirical experiments, on both synthetic data and real-world applications, further validate the benefits of causal exploration. The source code is available at https://github.com/CMACH508/CausalExploration.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Active learning
Machine Learning -> ML: Causality
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
685
Agentive Permissions in Multiagent Systems
Qi Shi
6 min. talk | August 8th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (2/2)
[+] More
[-] Less
This paper proposes to distinguish four forms of agentive permissions in multiagent settings. The main technical results are the complexity analysis of model checking, the semantic undefinability of modalities that capture these forms of permissions through each other, and a complete logical system capturing the interplay between these modalities.
List of keywords
Knowledge Representation and Reasoning -> KRR: Reasoning about actions
AI Ethics, Trust, Fairness -> ETF: Moral decision making
AI Ethics, Trust, Fairness -> ETF: AI and law, governance, regulation
AI Ethics, Trust, Fairness -> ETF: Ethical, legal and societal issues
694
Exploring the Inefficiency of Heavy Ball as Momentum Parameter Approaches 1
Xiaoge Deng, Tao Sun, Dongsheng Li, Xicheng Lu
6 min. talk | August 9th at 10:00 | Session: ML: Optimization
[+] More
[-] Less
The heavy ball momentum method is a commonly used technique for accelerating training processes in the machine learning community. However, empirical evidence suggests that the convergence of stochastic gradient descent (SGD) with heavy ball may slow down when the momentum hyperparameter approaches 1. Despite this observation, there are no established theories or solutions to explain and address this issue. In this study, we provide the first theoretical result that elucidates why momentum slows down SGD as it tends to 1. To better understand this inefficiency, we focus on the quadratic convex objective in the analysis. Our findings show that momentum accelerates SGD when the scaling parameter is not very close to 1. Conversely, when the scaling parameter approaches 1, momentum impairs SGD and degrades its stability. Based on the theoretical findings, we propose a descending warmup technique for the heavy ball momentum, which exploits the advantages of the heavy ball method and overcomes the inefficiency problem when the momentum tends to 1. Numerical results demonstrate the effectiveness of the proposed SHB-DW algorithm.
List of keywords
Machine Learning -> ML: Optimization
Machine Learning -> ML: Applications
Machine Learning -> ML: Learning theory
705
Counterfactual User Sequence Synthesis Augmented with Continuous Time Dynamic Preference Modeling for Sequential POI Recommendation
Lianyong Qi, Yuwen Liu, Weiming Liu, Shichao Pei, Xiaolong Xu, Xuyun Zhang, Yingjie Wang, Wanchun Dou
6 min. talk | August 9th at 10:00 | Session: DM: Recommender systems
[+] More
[-] Less
With the proliferation of Location-based Social Networks (LBSNs), user check-in data at Points-of-Interest (POIs) has surged, offering rich insights into user preferences. However, sequential POI recommendation systems always face two pivotal challenges. A challenge lies in the difficulty of modeling time in a discrete space, which fails to accurately capture the dynamic nature of user preferences. Another challenge is the inherent sparsity and noise in continuous POI recommendation, which hinder the recommendation process. To address these challenges, we propose counterfactual user sequence synthesis with continuous time dynamic preference modeling (CussCtpm). CussCtpm innovatively combines Gated Recurrent Unit (GRU) with neural Ordinary Differential Equations (ODEs) to model user preferences in a continuous time framework. CussCtpm captures user preferences at both the POI-level and interest-level, identifying deterministic and non-deterministic preference concepts. Particularly at the interest-level, we employ GRU and neural ODEs to model users’ dynamic preferences in continuous space, aiming to capture finer-grained shifts in user preferences over time. Furthermore, CussCtpm utilizes counterfactual data augmentation to generate counterfactual positive and negative user sequences. Our extensive experiments on two widely-used public datasets demonstrate that CussCtpm outperforms several advanced baseline models.
List of keywords
Data Mining -> DM: Recommender systems
709
Conflict-Alleviated Gradient Descent for Adaptive Object Detection
Wenxu Shi, Bochuan Zheng
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Unsupervised domain adaptive object detection (DAOD) aims to adapt detectors from a labeled source domain to an unlabelled target domain. Existing DAOD works learn feature representations with both class discriminative and domain invariant by jointly minimizing the loss across domain alignment and detection tasks. However, joint resolution of different tasks may lead to conflicts, with one contributing factor being gradient conflicts during optimization. If left untouched, such disagreement may degrade adaptation performance. In this work, we propose an efficient optimization strategy named Conflict-Alleviated Gradient descent (CAGrad) which aims to alleviate the conflict between two tasks (i.e., alignment and classification). Particularly, we alter the gradients by projecting each onto the normal plane of the other. The projection operation changes conflicting gradients from obtuse angles to acute angles, thus alleviating the conflict and achieving gradient harmonization. We further validate our theoretical analysis and methods on several domain adaptive object detection tasks, including cross-camera, weather, scene, and synthetic to real-world adaptation. Extensive experiments on multiple DAOD benchmarks demonstrate the effectiveness and superiority of our CAGrad.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Optimization
Machine Learning -> ML: Unsupervised learning
715
Invertible Residual Rescaling Models
Jinmin Li, Tao Dai, Yaohua Zha, Yilu Luo, Longfei Lu, Bin Chen, Zhi Wang, Shu-Tao Xia, Jingyun Zhang
6 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Invertible Rescaling Networks (IRNs) and their variants have witnessed remarkable achievements in various image processing tasks like image rescaling. However, we observe that IRNs with deeper networks are difficult to train, thus hindering the representational ability of IRNs. To address this issue, we propose Invertible Residual Rescaling Models (IRRM) for image rescaling by learning a bijection between a high-resolution image and its low-resolution counterpart with a specific distribution. Specifically, we propose IRRM to build a deep network, which contains several Residual Downscaling Modules (RDMs) with long skip connections. Each RDM consists of several Invertible Residual Blocks (IRBs) with short connections. In this way, RDM allows rich low-frequency information to be bypassed by skip connections and forces models to focus on extracting high-frequency information from the image. Extensive experiments show that our IRRM performs significantly better than other state-of-the-art methods with much fewer parameters and complexity. Particularly, our IRRM has respectively PSNR gains of at least 0.3 dB over HCFlow and IRN in the x4 rescaling while only using 60% parameters and 50% FLOPs. The code will be available at https://github.com/THU-Kingmin/IRRM.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Applications
719
Probabilistic Contrastive Learning for Domain Adaptation
Junjie Li, Yixin Zhang, Zilei Wang, Saihui Hou, Keyu Tu, Man Zhang
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Contrastive learning has shown impressive success in enhancing feature discriminability for various visual tasks in a self-supervised manner, but the standard contrastive paradigm (features+l2 normalization) has limited benefits when applied in domain adaptation. We find that this is mainly because the class weights (weights of the final fully connected layer) are ignored in the domain adaptation optimization process, which makes it difficult for features to cluster around the corresponding class weights. To solve this problem, we propose the simple but powerful Probabilistic Contrastive Learning (PCL), which moves beyond the standard paradigm by removing l2 normalization and replacing the features with probabilities. PCL can guide the probability distribution towards a one-hot configuration, thus minimizing the discrepancy between features and class weights. We conduct extensive experiments to validate the effectiveness of PCL and observe consistent performance gains on five tasks, i.e., Unsupervised/Semi-Supervised Domain Adaptation (UDA/SSDA), Semi-Supervised Learning (SSL), UDA Detection and Semantic Segmentation. Notably, for UDA Semantic Segmentation on SYNTHIA, PCL surpasses the sophisticated CPSL-D by 2% in terms of mean IoU with a much lower training cost (PCL: 1*3090, 5 days v.s. CPSL-D: 4*V100, 11 days). Code is available at https://github.com/ljjcoder/Probabilistic-Contrastive-Learning.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Computer Vision -> CV: Recognition (object detection, categorization)
756
WSRFNet: Wavelet-Based Scale-Specific Recurrent Feedback Network for Diabetic Retinopathy Lesion Segmentation
Xuan Li, Xiangqian Wu
6 min. talk | August 9th at 10:00 | Session: CV: Biomedical image analysis
[+] More
[-] Less
Diabetic retinopathy lesion segmentation (DRLS) faces a challenge of significant variation in the size of different lesions. An effective method to address this challenge is to fuse multi-scale features. To boost the performance of this kind of method, most existing DRLS methods work on devising sophisticated multi-scale feature fusion modules. Differently, we focus on improving the quality of the multi-scale features to enhance the fused multi-scale feature representation. To this end, we design a Wavelet-based Scale-specific Recurrent Feedback Network (WSRFNet), which refines multi-scale features using recurrent feedback mechanism. Specifically, to avoid information loss when introducing feedback to multi-scale features, we propose a wavelet-based feedback pyramid module (WFPM), which is based on a reversible downsampling operation, i.e., Haar wavelet transform. Unlike scale-agnostic feedback used in previous feedback methods, we develop a scale-specific refinement module (SRM), which utilizes scale-specific feedback to pointedly refine features of different scales. Experimental results on IDRiD and DDR datasets show that our approach outperforms state-of-the-art models. The code is available at https://github.com/xuanli01/WSRFNet.
List of keywords
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Segmentation
768
Generating More Audios for End-to-End Spoken Language Understanding
Xuxin Cheng, Yuexian Zou
6 min. talk | August 6th at 11:30 | Session: NLP: Dialogue and interactive systems
[+] More
[-] Less
End-to-end spoken language understanding (SLU) aims to directly capture the comprehensive semantics from the given spoken utterance without generating any transcript. Since the transcripts might not always be available, Textless SLU is attracting increasing attention, which could eliminate the need for transcripts but often does not perform as well as SLU models trained with transcripts. In this paper, we focus on the scenarios where the transcripts are not available and propose a framework GMA-SLU to generate more audios according to the labels. In order to alleviate the modality gap between text and audio, two language models are developed and discrete tokens are utilized as a bridge, where the first language model utilizes labels to generate semantic tokens and the second language model adopts these obtained semantic tokens and the acoustic tokens of source audios to generate the synthetic audios. All the experiments are conducted on the monolingual SLU dataset SLURP and the multilingual SLU dataset MINDS-14. Experimental results show that our method outperforms the previous best Textless End-to-end SLU models and can obtain the comparable performance with the models trained with the assistance of the corresponding transcripts.
List of keywords
Natural Language Processing -> NLP: Dialogue and interactive systems
775
A New Guaranteed Outlier Removal Method Based on Plane Constraints for Large-Scale LiDAR Point Cloud Registration
Gang Ma, Hui Wei, Runfeng Lin, Jialiang Wu
6 min. talk | August 8th at 11:30 | Session: ROB: Robotics (2/2)
[+] More
[-] Less
In this paper, we present a novel registration method based on plane constraints for large-scale LiDAR point clouds, effectively decoupling rotation estimation and translation estimation. For rotation estimation, we propose an outlier removal method that combines coarse filtering with rotation-invariant constraints and refined filtering based on computational geometric consistency checks, effectively pruning outliers and robustly estimating accurate relative rotations from plane normals. In translation estimation, we propose a component-wise method based on plane translation constraints to efficiently estimate relative translations. The robustness and effectiveness of our proposed method are empirically validated on three popular LiDAR point cloud datasets. The experimental results convincingly demonstrate that our approach achieves state-of-the-art performance.
List of keywords
Robotics -> ROB: Robotics and vision
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Scene analysis and understanding   
Robotics -> ROB: Perception
778
OSIC: A New One-Stage Image Captioner Coined
Bo Wang, Zhao Zhang, Mingbo Zhao, Xiaojie Jin, Mingliang Xu, Meng Wang
6 min. talk | August 9th at 11:30 | Session: CV: Machine learning for vision
[+] More
[-] Less
Mainstream image captioning models are usually two-stage captioners, i.e., encoding the region features by a pre-trained detector and then feeding them into a language model to generate the captions. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance, because the region features in the detection task are suboptimal representations and cannot provide all the necessary information for subsequent captions generation. Besides, the region features are usually represented from the last layer of the detectors that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms the images into descriptive sentences in one stage for eliminating the information gap. Specifically, to obtain rich features, multi-level features are captured by Swin Transformer, and then fed into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining to non-locally model the features interaction. As a result, OSIC can directly obtain rich semantic information to improve the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30K datasets verified the superior performance of our method.
List of keywords
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Scene analysis and understanding   
Computer Vision -> CV: Vision, language and reasoning
Natural Language Processing -> NLP: Language generation
805
Strengthening Layer Interaction via Dynamic Layer Attention
Kaishen Wang, Xun Xia, Jian Liu, Zhang Yi, Tao He
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (3/6)
[+] More
[-] Less
In recent years, employing layer attention to enhance interaction among hierarchical layers has proven to be a significant advancement in building network structures. In this paper, we delve into the distinction between layer attention and the general attention mechanism, noting that existing layer attention methods achieve layer interaction on fixed feature maps in a static manner. These static layer attention methods limit the ability for context feature extraction among layers. To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention (DLA) architecture. The DLA comprises dual paths, where the forward path utilizes an improved recurrent neural network block, named Dynamic Sharing Unit (DSU), for context feature extraction. The backward path updates features using these shared context representations. Finally, the attention mechanism is applied to these dynamically refreshed feature maps among layers. Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks. Additionally, the DSU block has been evaluated as an efficient plugin in the proposed DLA architecture. The code is available at https://github.com/tunantu/Dynamic-Layer-attention.
List of keywords
Machine Learning -> ML: Attention models
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
Machine Learning -> ML: Theory of deep learning
822
BeyondVision: An EMG-driven Micro Hand Gesture Recognition Based on Dynamic Segmentation
Nana Wang, Jianwei Niu, Xuefeng Liu, Dongqin Yu, Guogang Zhu, Xinghao Wu, Mingliang Xu, Hao Su
6 min. talk | August 6th at 11:30 | Session: MTA: Multidisciplinary Topics and Applications (1/2)
[+] More
[-] Less
Hand gesture recognition (HGR) plays a pivotal role in natural and intuitive human-computer interactions. Recent HGR methods focus on recognizing gestures from vision-based images or videos. However, vision-based methods are limited in recognizing micro hand gestures (MHGs) (e.g., pinch within 1cm) and gestures with occluded fingers. To address these issues, combined with the electromyography (EMG) technique, we propose BeyondVision, an EMG-driven MHG recognition system based on deep learning. BeyondVision consists of a wristband-style EMG sampling device and a tailored lightweight neural network BV-Net that can accurately translate EMG signals of MHGs to control commands in real-time. Moreover, we propose a post-processing mechanism and a weight segmentation algorithm to effectively improve the accuracy rate of MHG recognition. Subjective and objective experimental results show that our approach achieves over 95% average recognition rate, 2000Hz sampling frequency, and real-time micro gesture recognition. Our technique has been applied in a commercially available product, introduced at: https://github.com/tyc333/NoBarriers.
List of keywords
Multidisciplinary Topics and Applications -> MTA: AI hardware
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Machine Learning -> ML: Applications
Multidisciplinary Topics and Applications -> MTA: Interactive entertainment
825
Dynamically Anchored Prompting for Task-Imbalanced Continual Learning
Chenxing Hong, Yan Jin, Zhiqi Kang, Yizhou Chen, Mengke Li, Yang Lu, Hanzi Wang
6 min. talk | August 7th at 10:00 | Session: ML: Incremental learning
[+] More
[-] Less
Existing continual learning literature relies heavily on a strong assumption that tasks arrive with a balanced data stream, which is often unrealistic in real-world applications. In this work, we explore task-imbalanced continual learning (TICL) scenarios where the distribution of task data is non-uniform across the whole learning process. We find that imbalanced tasks significantly challenge the capability of models to control the trade-off between stability and plasticity from the perspective of recent prompt-based continual learning methods. On top of the above finding, we propose Dynamically Anchored Prompting (DAP), a prompt-based method that only maintains a single general prompt to adapt to the shifts within a task stream dynamically. This general prompt is regularized in the prompt space with two specifically designed prompt anchors, called boosting anchor and stabilizing anchor, to balance stability and plasticity in TICL. Remarkably, DAP achieves this balance by only storing a prompt across the data stream, therefore offering a substantial advantage in rehearsal-free CL. Extensive experiments demonstrate that the proposed DAP results in 4.5% to 15% absolute improvements over state-of-the-art methods on benchmarks under task-imbalanced settings. Our code is available at https://github.com/chenxing6666/DAP.
List of keywords
Machine Learning -> ML: Incremental learning
Computer Vision -> CV: Recognition (object detection, categorization)
Data Mining -> DM: Class imbalance and unequal cost
Machine Learning -> ML: Classification
883
Efficient Screen Content Image Compression via Superpixel-based Content Aggregation and Dynamic Feature Fusion
Sheng Shen, Huanjing Yue, Jingyu Yang
6 min. talk | August 8th at 11:30 | Session: CV: Image and video synthesis and generation (1/2)
[+] More
[-] Less
This paper addresses the challenge of efficiently compressing screen content images (SCIs) – computer generated images with unique attributes such as large uniform regions, sharp edges, and limited color palettes, which pose difficulties for conventional compression algorithms. We propose a Superpixel-based Content Aggregation Block (SCAB) to aggregate local pixels into one super-pixel and aggregate non-local information via super-pixel transformer. Such aggregation enables the dynamic assimilation of non-local information while maintaining manageable complexity. Furthermore, we enhance our channel-wise context entropy model with a Dynamic Feature Fusion (DFF) mechanism. This mechanism integrates decoded slices and side information dynamically based on their global correlation, allowing the network to dynamically learn the optimal weights for global information usage. Extensive experiments on three SCI datasets (SCID, CCT, and SIQAD) show our method’s superior RD performance and inference time, making it the first network comparable with the advanced VVC-SCC standard.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Other
889
Bandits with Concave Aggregated Reward
Yingqi Yu, Sijia Zhang, Shaoang Li, Lan Zhang, Wei Xie, Xiang-Yang Li
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (5/6)
[+] More
[-] Less
Multi-armed bandit is a simple but powerful algorithmic framework, and many effective algorithms have been proposed for various online models. In numerous applications, the decision-maker faces diminishing marginal utility. With non-linear aggregations, those algorithms often have poor regret bounds. Motivated by this, we study a bandit problem with diminishing marginal utility, which we termed the bandits with concave aggregated reward(BCAR). To tackle this problem, we propose two algorithms SW-BCAR and SWUCB-BCAR. Through theoretical analysis, we establish the effectiveness of these algorithms in addressing the BCAR issue. Extensive simulations demonstrate that our algorithms achieve better results than the most advanced bandit algorithms.
List of keywords
Machine Learning -> ML: Multi-armed bandits
898
Shap-Mix: Shapley Value Guided Mixing for Long-Tailed Skeleton Based Action Recognition
Jiahang Zhang, Lilang Lin, Jiaying Liu
6 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
In real-world scenarios, human actions often fall into a long-tailed distribution. It makes the existing skeleton-based action recognition works, which are mostly designed based on balanced datasets, suffer from a sharp performance degradation. Recently, many efforts have been made to image/video long-tailed learning. However, directly applying them to skeleton data can be sub-optimal due to the lack of consideration of the crucial spatial-temporal motion patterns, especially for some modality-specific methodologies such as data augmentation. To this end, considering the crucial role of the body parts in the spatially concentrated human actions, we attend to the mixing augmentations and propose a novel method, Shap-Mix, which improves long-tailed learning by mining representative motion patterns for tail categories. Specifically, we first develop an effective spatial-temporal mixing strategy for the skeleton to boost representation quality. Then, the employed saliency guidance method is presented, consisting of the saliency estimation based on Shapley value and a tail-aware mixing policy. It preserves the salient motion parts of minority classes in mixed data, explicitly establishing the relationships between crucial body structure cues and high-level semantics. Extensive experiments on three large-scale skeleton datasets show our remarkable performance improvement under both long-tailed and balanced settings. Our project is publicly available at: https://jhang2020.github.io/Projects/Shap-Mix/Shap-Mix.html.
List of keywords
Computer Vision -> CV: Action and behavior recognition
899
Class-consistent Contrastive Learning Driven Cross-dimensional Transformer for 3D Medical Image Classification
Qikui Zhu, Chuan Fu, Shuo Li
6 min. talk | August 9th at 10:00 | Session: CV: Biomedical image analysis
[+] More
[-] Less
Transformer emerges as an active research topic in medical image analysis. Yet, three substantial challenges limit the effectiveness of both 2D and 3D Transformers in 3D medical image classification: 1) Challenge in capturing spatial structure correlation due to the unreasonable flattening operation; 2) Challenge in burdening the high computational complexity and memory consumption due to the quadratic growth of computational complexity and memory consumption for 3D medical data; 3) Challenge in discriminative representation learning, due to data-sensitivity. To address the above challenges, a novel Cross-dimensional Transformer (CdTransformer) and a creative Class-consistent Contrastive Learning (CcCL) are proposed. Specifically, CdTransformer consists of two novel modules: 1) Cross-dimensional Attention Module (CAM), which breaks the limitation that Transformer cannot reasonably establish spatial structure correlation when meeting 3D medical data, meanwhile, reduces the computational complexity and memory consumption. 2) Inter-dimensional Feed-forward Network (IdFN), which addresses the challenge of traditional feed-forward networks not being able to learn depth dimension information that is unique to 3D medical data. CcCL innovatively takes full advantage of the inter-class and intra-class features from the slice-distorted samples to boost Transformer in learning feature representation. CdTransformer and CcCL are validated on six 3D medical image classification tasks. Extensive experimental results demonstrate that CdTransformer outperforms state-of-the-art CNNs and Transformers on 3D medical image classification, and CcCL enables significantly improving Transformer in discriminative representation learning.
List of keywords
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Applications
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Classification
901
Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution
Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong
6 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup table (LUT)-based SR schemes that employ simple LUT readout and largely elude CNN computation. Nonetheless, the multi-megabyte LUTs in existing methods still prohibit on-chip storage and necessitate off-chip memory transport. This work tackles this storage hurdle and innovates hundred-kilobyte LUT (HKLUT) models amenable to on-chip cache. Utilizing an asymmetric two-branch multistage network coupled with a suite of specialized kernel patterns, HKLUT demonstrates an uncompromising performance and superior hardware efficiency over existing LUT schemes. Our implementation is publicly available at: https://github.com/jasonli0707/hklut.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Applications
907
MAS-SAM: Segment Any Marine Animal with Aggregated Features
Tianyu Yan, Zifu Wan, Xinhao Deng, Pingping Zhang, Yang Liu, Huchuan Lu
12 min. talk | August 6th at 11:30 | Session: ROB: Robotics (1/2)
[+] More
[-] Less
Recently, Segment Anything Model (SAM) shows exceptional performance in generating high-quality object masks and achieving zero-shot image segmentation. However, as a versatile vision model, SAM is primarily trained with large-scale natural light images. In underwater scenes, it exhibits substantial performance degradation due to the light scattering and absorption. Meanwhile, the simplicity of the SAM’s decoder might lead to the loss of fine-grained object details. To address the above issues, we propose a novel feature learning framework named MAS-SAM for marine animal segmentation, which involves integrating effective adapters into the SAM’s encoder and constructing a pyramidal decoder. More specifically, we first build a new SAM’s encoder with effective adapters for underwater scenes. Then, we introduce a Hypermap Extraction Module (HEM) to generate multi-scale features for a comprehensive guidance. Finally, we propose a Progressive Prediction Decoder (PPD) to aggregate the multi-scale features and predict the final segmentation results. When grafting with the Fusion Attention Module (FAM), our method enables to extract richer marine information from global contextual cues to fine-grained local details. Extensive experiments on four public MAS datasets demonstrate that our MAS-SAM can obtain better results than other typical segmentation methods. The source code is available at https://github.com/Drchip61/MAS-SAM.
List of keywords
Robotics -> ROB: Applications
Robotics -> ROB: Perception
Robotics -> ROB: Robotics and vision
932
Hierarchical Reinforcement Learning on Multi-Channel Hypergraph Neural Network for Course Recommendation
Lu Jiang, Yanan Xiao, Xinxin Zhao, Yuanbo Xu, Shuli Hu, Pengyang Wang, Minghao Yin
6 min. talk | August 7th at 11:30 | Session: DM: Applications
[+] More
[-] Less
With the widespread popularity of massive open online courses, personalized course recommendation has become increasingly important due to enhancing users’ learning efficiency. While achieving promising performances, current works suffering from the vary across the users and other MOOC entities. To address this problem, we propose hierarchical reinforcement learning with a multi-channel hypergraphs neural network for course recommendation(called HHCoR). Specifically, we first construct an online course hypergraph as the environment to capture the complex relationships and historical information by considering all entities. Then, we design a multi-channel propagation mechanism to aggregate embeddings in the online course hypergraph and extract user interest through an attention layer. Besides, we employ two-level decision-making: the low-level focuses on the rating courses, while the high-level integrates these considerations to finalize the decision. Furthermore, in co-optimization, we design a joint reward function to improve the policy of two-layer agents. Finally, we conducted extensive experiments on two real-world datasets and the quantitative results have demonstrated the effectiveness of the proposed method.
List of keywords
Data Mining -> DM: Applications
Data Mining -> DM: Mining graphs
Data Mining -> DM: Mining heterogenous data
Data Mining -> DM: Mining spatial and/or temporal data
936
Evolutionary Generalized Zero-Shot Learning
Dubing Chen, Chenyi Jiang, Haofeng Zhang
12 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Attribute-based Zero-Shot Learning (ZSL) has revolutionized the ability of models to recognize new classes not seen during training. However, with the advancement of large-scale models, the expectations have risen. Beyond merely achieving zero-shot generalization, there is a growing demand for universal models that can continually evolve in expert domains using unlabeled data. To address this, we introduce a scaled-down instantiation of this challenge: Evolutionary Generalized Zero-Shot Learning (EGZSL). This setting allows a low-performing zero-shot model to adapt to the test data stream and evolve online. We elaborate on three challenges of this special task, \ie, catastrophic forgetting, initial prediction bias, and evolutionary data class bias. Moreover, we propose targeted solutions for each challenge, resulting in a generic method capable of continuous evolution from a given initial IGZSL model. Experiments on three popular GZSL benchmark datasets demonstrate that our model can learn from the test data stream while other baselines fail. The codes are available at https://github.com/cdb342/EGZSL.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Computer Vision -> CV: Vision, language and reasoning
937
Learning Spatial Similarity Distribution for Few-shot Object Counting
Yuanwu Xu, Feifan Song, Haofeng Zhang
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Few-shot object counting aims to count the number of objects in a query image that belong to the same class as the given exemplar images. Existing methods compute the similarity between the query image and exemplars in the 2D spatial domain and perform regression to obtain the counting number. However, these methods overlook the rich information about the spatial distribution of similarity on the exemplar images, leading to significant impact on matching accuracy. To address this issue, we propose a network learning Spatial Similarity Distribution (SSD) for few-shot object counting, which preserves the spatial structure of exemplar features and calculates a 4D similarity pyramid point-to-point between the query features and exemplar features, capturing the complete distribution information for each point in the 4D similarity space. We propose a Similarity Learning Module (SLM) which applies the efficient center-pivot 4D convolutions on the similarity pyramid to map different similarity distributions to distinct predicted density values, thereby obtaining accurate count. Furthermore, we also introduce a Feature Cross Enhancement (FCE) module that enhances query and exemplar features mutually to improve the accuracy of feature matching. Our approach outperforms state-of-the-art methods on multiple datasets, including FSC-147 and CARPK. Code is available at https://github.com/CBalance/SSD.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Computer Vision -> CV: Recognition (object detection, categorization)
965
ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles
Kai Zhao, Jianye Hao, Yi Ma, Jinyi Liu, Yan Zheng, Zhaopeng Meng
6 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
Offline reinforcement learning (RL) is a learning paradigm where an agent learns from a fixed dataset of experience. However, learning solely from a static dataset can limit the performance due to the lack of exploration. To overcome it, offline-to-online RL combines offline pre-training with online fine-tuning, which enables the agent to further refine its policy by interacting with the environment in real-time. Despite its benefits, existing offline-to-online RL methods suffer from performance degradation and slow improvement during the online phase. To tackle these challenges, we propose a novel framework called ENsemble-based Offline-To-Online (ENOTO) RL. By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance. Moreover, to expedite online performance enhancement, we appropriately loosen the pessimism of Q-value estimation and incorporate ensemble-based exploration mechanisms into our framework. Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods during online fine-tuning on a range of locomotion and navigation tasks, significantly outperforming existing offline-to-online RL methods.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Ensemble methods
Machine Learning -> ML: Offline reinforcement learning
Machine Learning -> ML: Online learning
973
Optimisation and Approximation in Abstract Argumentation: The Case of Stable Semantics
Matthias Thimm
6 min. talk | August 6th at 15:00 | Session: KRR: Argumentation
[+] More
[-] Less
We analyse two soft notions of stable extensions in abstract argumentation, one that weakens the requirement of having full range and one that weakens the requirement of conflict-freeness. We then consider optimisation problems over these two notions that represent optimisation variants of the credulous reasoning problem with stable semantics. We investigate the computational complexity of these two problems in terms of the complexity of solving the optimisation problem exactly and in terms of approximation complexity. We also present some polynomial-time approximation algorithms for these optimisation problems and investigate their approximation quality experimentally.
List of keywords
Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
980
Delve into Base-Novel Confusion: Redundancy Exploration for Few-Shot Class-Incremental Learning
Haichen Zhou, Yixiong Zou, Ruixuan Li, Yuhua Li, Kui Xiao
6 min. talk | August 7th at 10:00 | Session: ML: Incremental learning
[+] More
[-] Less
Few-shot class-incremental learning (FSCIL) aims to acquire knowledge from novel classes with limited samples while retaining information about base classes. Existing methods address catastrophic forgetting and overfitting by freezing the feature extractor during novel-class learning. However, these methods usually tend to cause the confusion between base and novel classes, i.e., classifying novel-class samples into base classes.In this paper, we delve into this phenomenon to study its cause and solution. We first interpret the confusion as the collision between the novel-class and the base-class region in the feature space.Then, we find the collision is caused by the label-irrelevant redundancies within the base-class feature and pixel space. Through qualitative and quantitative experiments, we identify this redundancy as the shortcut in the base-class training, which can be decoupled to alleviate the collision. Based on this analysis, to alleviate the collision between base and novel classes, we propose a method for FSCIL named Redundancy Decoupling and Integration (RDI). RDI first decouples redundancies from base-class space to shrink the intra-base-class feature space. Then, it integrates the redundancies as a dummy class to enlarge the inter-base-class feature space. This process effectively compresses the base-class feature space, creating buffer space for novel classes and alleviating the model’s confusion between the base and novel classes. Extensive experiments across benchmark datasets, including CIFAR-100, miniImageNet, and CUB-200-2011 demonstrate that our method achieves state-of-the-art performance.
List of keywords
Machine Learning -> ML: Incremental learning
Machine Learning -> ML: Few-shot learning
993
MetaISP: Efficient RAW-to-sRGB Mappings with Merely 1M Parameters
Zigeng Chen, Chaowei Liu, Yuan Yuan, Michael Bi Mi, Xinchao Wang
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
State-of-the-art deep ISP models alleviate the dilemma of limited generalization capabilities across heterogeneous inputs by increasing the size and complexity of the network, which inevitably leads to considerable growth in parameter counts and FLOPs. To address this challenge, this paper presents MetaISP – a streamlined model that achieves superior reconstruction quality by adaptively modulating its parameters and architecture in response to diverse inputs. Our rationale revolves around obtaining corresponding spatial and channel-wise correction matrices for various inputs within distinct feature spaces, which assists in assigning optimal attention. This is achieved by predicting dynamic weights for each input image and combining these weights with multiple learnable basis matrices to construct the correction matrices. The proposed MetaISP makes it possible to obtain best performance while being computationally efficient. SOTA results are achieved on two large-scale datasets, e.g. 23.80dB PSNR on ZRR, exceeding the previous SOTA 0.19dB with only 9.2% of its parameter count and 10.6% of its FLOPs; 25.06dB PSNR on MAI21, exceeding the previous SOTA 0.17dB with only 0.9% of its parameter count and 2.7% of its FLOPs.
List of keywords
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Applications
996
Tolerating Outliers: Gradient-Based Penalties for Byzantine Robustness and Inclusion
Latifa Errami, El Houcine Bergou
12 min. talk | August 9th at 10:00 | Session: ML: Robustness
[+] More
[-] Less
This work investigates the interplay between Robustness and Inclusion in the context of poisoning attacks targeting the convergence of Stochastic Gradient Descent (SGD). While robustness has received significant attention, the standard Byzantine defenses rely on the Independent and Identically Distributed (IID) assumption causing their performance to deteriorate on non-IID data distributions, even without any attack. This is largely due to these defenses being excessively cautious and discarding benign outliers. We introduce a penalty-based aggregation that accounts for the discrepancy between trusted clients and outliers. We propose the use of Linear Scalarization (LS) as an enhancing method to enable current defenses to simultaneously circumvent Byzantine attacks while also granting inclusion of outliers. This empowers existing defenses to not only counteract malicious adversaries effectively but also to incorporate outliers into the learning process. We conduct a theoretical analysis to demonstrate the convergence of our approach. Specifically, we establish the robustness and resilience of our method under standard assumptions. Empirical analysis further validates the viability of the proposed approach. Across mild to strong non-IID data splits, our method consistently either matches or surpasses the performance of current approaches in the literature, under state-of-the-art Byzantine attack scenarios.
List of keywords
Machine Learning -> ML: Robustness
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
Machine Learning -> ML: Trustworthy machine learning
998
LSPAN: Spectrally Localized Augmentation for Graph Consistency Learning
Heng-Kai Zhang, Yi-Ge Zhang, Zhi Zhou, Yu-Feng Li
6 min. talk | August 8th at 11:30 | Session: ML: Semi-supervised learning
[+] More
[-] Less
Graph-based consistency principle has been successfully applied to many semi-supervised problems in machine learning. Its performance largely depends on the quality of augmented graphs, which has been recently proven that revealing graph properties and maintaining the invariance of graphs are crucial for good performance. However, existing topology- or feature-based augmentation methods are spectrally non-localized — important spectrums are disturbed throughout the entire frequency range, and their invariance may not be well preserved. Efforts on this issue remain to be limited. This paper proposes a simple yet effective model called Localized SPectral AugmentatioN (LSPAN), which perturbs a concentrated part of graph spectrum with equivalent intensity using Fourier orthogonality, so as to enhance graph spectrum preservation as well as model prediction. Moreover, it also avoids the significant training time of inverse Fourier transform. Extensive empirical evaluation on real-world datasets clearly shows the performance gain of spectrally localized augmentation, as well as its good convergence and efficiency compared to existing graph methods.
List of keywords
Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Active learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Multi-task and transfer learning
1018
Scaling Up Unbiased Search-based Symbolic Regression
Paul Kahlmeyer, Joachim Giesen, Michael Habeck, Henrik Voigt
12 min. talk | August 8th at 15:00 | Session: ML: Machine Learning (6/6)
[+] More
[-] Less
In a regression task, a function is learned from labeled data to predict the labels at new data points. The goal is to achieve small prediction errors. In symbolic regression, the goal is more ambitious, namely, to learn an interpretable function that makes small prediction errors. This additional goal largely rules out the standard approach used in regression, that is, reducing the learning problem to learning parameters of an expansion of basis functions by optimization. Instead, symbolic regression methods search for a good solution in a space of symbolic expressions. To cope with the typically vast search space, most symbolic regression methods make implicit, or sometimes even explicit, assumptions about its structure. Here, we argue that the only obvious structure of the search space is that it contains small expressions, that is, expressions that can be decomposed into a few subexpressions. We show that systematically searching spaces of small expressions finds solutions that are more accurate and more robust against noise than those obtained by state-of-the-art symbolic regression methods. In particular, systematic search outperforms state-of-the-art symbolic regressors in terms of its ability to recover the true underlying symbolic expressions on established benchmark data sets.
List of keywords
Machine Learning -> ML: Symbolic methods
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Regression
Search -> S: Search and machine learning
1019
Bridging LiDAR Gaps: A Multi-LiDARs Domain Adaptation Dataset for 3D Semantic Segmentation
Shaoyang Chen, Bochun Yang, Yan Xia, Ming Cheng, Siqi Shen, Cheng Wang
12 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
We focus on the domain adaptation problem for 3D semantic segmentation, addressing the challenge of data variability in point clouds collected by different LiDARs. Existing benchmarks often mix different types of datasets, which blurs and complicates segmentation evaluations. Here, we introduce a Multi-LiDARs Domain Adaptation Segmentation (MLDAS) dataset, which contains point-wise semantic annotated point clouds captured simultaneously by a 128-beam LiDAR, a 64-beam LiDAR, a 32-beam LiDAR. We select 31,875 scans from 2 representative scenarios: campus and urban street. Furthermore, we evaluate the current 3D segmentation unsupervised domain adaptation methods on the proposed dataset and propose Hierarchical Segmentation Network with Spatial Consistency (HSSC) as a novel knowledge transfer method to mitigate the domain gap significantly using spatial-temporal consistency constraints. Extensive experiments show that HSSC greatly improves the state-of-the-art cross-domain semantic segmentation methods. Our project is available at https://sychen320.github.io/projects/MLDAS.
List of keywords
Computer Vision -> CV: 3D computer vision
1025
A De-singularity Subgradient Approach for the Extended Weber Location Problem
Zhao-Rong Lai, Xiaotian Wu, Liangda Fang, Ziliang Chen
6 min. talk | August 9th at 10:00 | Session: ML: Optimization
[+] More
[-] Less
The extended Weber location problem is a classical optimization problem that has inspired some new works in several machine learning scenarios recently. However, most existing algorithms may get stuck due to the singularity at the data points when the power of the cost function 1\<= q<2, such as the widely-used iterative Weiszfeld approach. In this paper, we establish a de-singularity subgradient approach for this problem. We also provide a complete proof of convergence which has fixed some incomplete statements of the proofs for some previous Weiszfeld algorithms. Moreover, we deduce a new theoretical result of superlinear convergence for the iteration sequence in a special case where the minimum point is a singular point. We conduct extensive experiments in a real-world machine learning scenario to show that the proposed approach solves the singularity problem, produces the same results as in the non-singularity cases, and shows a reasonable rate of linear convergence. The results also indicate that the q-th power case (1<q<2) is more advantageous than the 1-st power case and the 2-nd power case in some situations. Hence the de-singularity subgradient approach is beneficial to advancing both theory and practice for the extended Weber location problem.
List of keywords
Machine Learning -> ML: Optimization
Constraint Satisfaction and Optimization -> CSO: Solvers and tools
Constraint Satisfaction and Optimization -> CSO: Other
1036
LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs
Taeho Kim, Yanming Wang, Vatshank Chaturvedi, Lokesh Gupta, Seyeon Kim, Yongin Kwon, Sangtae Ha
12 min. talk | August 9th at 10:00 | Session: NLP: Resources and evaluation
[+] More
[-] Less
Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However, determining the most effective method for achieving rapid fine-tuning while preventing GPU out-of-memory issues in a given environment remains unclear. To address this challenge, we introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods across multiple GPUs and identifies the optimal method. We conduct GPU memory usage estimation prior to fine-tuning, leveraging the fundamental structure of transformer-based decoder models and the memory usage distribution of each method. Experimental results show that LLMem accurately estimates peak GPU memory usage on a single GPU, with an error rate of up to 1.6%. Additionally, it shows an average error rate of 3.0% when applying distributed fine-tuning methods to LLMs with more than a billion parameters on multi-GPU setups.
List of keywords
Natural Language Processing -> NLP: Language models
Machine Learning -> ML: Deep learning architectures
1039
Markov Constraint as Large Language Model Surrogate
Alexandre Bonlarron, Jean-Charles Régin
6 min. talk | August 8th at 10:00 | Session: CSO: Constraint Satisfaction and Optimization
[+] More
[-] Less
This paper presents NgramMarkov, a variant of the Markov constraints. It is dedicated to text generation in constraint programming (CP). It involves a set of n-grams (i.e., sequence of n words) associated with probabilities given by a large language model (LLM). It limits the product of the probabilities of the n-gram of a sentence. The propagator of this constraint can be seen as an extension of the ElementaryMarkov constraint propagator, incorporating the LLM distribution instead of the maximum likelihood estimation of n-grams. It uses a gliding threshold, i.e., it rejects n-grams whose local probabilities are too low, to guarantee balanced solutions. It can also be combined with a "look-ahead" approach to remove n-grams that are very unlikely to lead to acceptable sentences for a fixed-length horizon. This idea is based on the MDDMarkovProcess constraint propagator, but without explicitly using an MDD (Multi-Valued Decision Diagram). The experimental results show that the generated text is valued in a similar way to the LLM perplexity function. Using this new constraint dramatically reduces the number of candidate sentences produced, improves computation times, and allows larger corpora or smaller n-grams to be used. A real-world problem has been solved for the first time using 4-grams instead of 5-grams.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Constraint programming
Constraint Satisfaction and Optimization -> CSO: Applications
Constraint Satisfaction and Optimization -> CSO: Modeling
Natural Language Processing -> NLP: Language generation
1042
Temporal Inductive Logic Reasoning over Hypergraphs
Yuan Yang, Siheng Xiong, Ali Payani, James C. Kerce, Faramarz Fekri
6 min. talk | August 7th at 11:30 | Session: KRR: Logic programming
[+] More
[-] Less
Inductive logic reasoning is a fundamental task in graph analysis, which aims to generalize patterns from data. This task has been extensively studied for traditional graph representations, such as knowledge graphs (KGs), using techniques like inductive logic programming (ILP). Existing ILP methods assume learning from KGs with static facts and binary relations. Beyond KGs, graph structures are widely present in other applications such as procedural instructions, scene graphs, and program executions. While ILP is beneficial for these applications, applying it to those graphs is nontrivial: they are more complex than KGs, which usually involve timestamps and n-ary relations, effectively a type of hypergraph with temporal events. In this work, we propose temporal inductive logic reasoning (TILR), an ILP method that reasons on temporal hypergraphs. To enable hypergraph reasoning, we introduce the multi-start random B-walk, a novel graph traversal method for hypergraphs. By combining it with a path-consistency algorithm, TILR learns logic rules by generalizing from both temporal and relational data. To address the lack of hypergraph benchmarks, we create and release two temporal hypergraph datasets: YouCook2-HG and nuScenes-HG. Experiments on these benchmarks demonstrate that TILR achieves superior reasoning capability over various strong baselines.
List of keywords
Knowledge Representation and Reasoning -> KRR: Logic programming
Data Mining -> DM: Knowledge graphs and knowledge base completion
1043
Imperfect-Recall Games: Equilibrium Concepts and Their Complexity
Emanuel Tewolde, Brian Hu Zhang, Caspar Oesterheld, Manolis Zampetakis, Tuomas Sandholm, Paul Goldberg, Vincent Conitzer
6 min. talk | August 8th at 15:00 | Session: GTEP: Game Theory and Economic Paradigms
[+] More
[-] Less
We investigate optimal decision making under imperfect recall, that is, when an agent forgets information it once held before. An example is the absentminded driver game, as well as team games in which the members have limited communication capabilities. In the framework of extensive-form games with imperfect recall, we analyze the computational complexities of finding equilibria in multiplayer settings across three different solution concepts: Nash, multiselves based on evidential decision theory (EDT), and multiselves based on causal decision theory (CDT). We are interested in both exact and approximate solution computation. As special cases, we consider (1) single-player games, (2) two-player zero-sum games and relationships to maximin values, and (3) games without exogenous stochasticity (chance nodes). We relate these problems to the complexity classes PPAD, PLS, Σ_2^P, ∃R, and ∃∀R.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
1062
Self-supervised Weighted Information Bottleneck for Multi-view Clustering
Zhengzheng Lou, Chaoyang Zhang, Hang Xue, Yangdong Ye, Qinglei Zhou, Shizhe Hu
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
Multi-view clustering (MVC) is a long-standing topic in machine learning and data mining community, focusing on investigating and utilizing the relationships among views for final consistent data cluster structure discovery. Generally, weighted MVC is one of the popular methods working by learning and applying the view weight/importance on each view for fully exploring the complementary information across views. However, most existing weighted MVCs only consider the quality of each view, ignoring the vital role of pseudo label self-supervision information in weight learning. In this work, we propose a novel self-supervised weighted information bottleneck (SWIB) method for solving the multi-view clustering problem. It combines the weighted information from different views based on information bottleneck theory, and the view weight learning mechanism is newly designed by simultaneously taking into accounting both the quality of view-contained information and the self-supervised information on the data partition of each view. Experimental results on multi-view text, multi-feature image, multi-angle video, and multi-modal text-image dataset as well as large-scale datasets show the superiority of the SWIB method. To our knowledge, this is the first work incorporating the self-supervised learning into weighted multi-view clustering.
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Unsupervised learning
1074
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
Zheqi He, Xinya Wu, Pengfei Zhou, Richeng Xuan, Guang Liu, Xi Yang, Qiannan Zhu, Hua Huang
6 min. talk | August 7th at 10:00 | Session: CV: Multimodal learning
[+] More
[-] Less
Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.
List of keywords
Computer Vision -> CV: Multimodal learning
Multidisciplinary Topics and Applications -> MTA: Education
1086
IMM: An Imitative Reinforcement Learning Approach with Predictive Representation Learning for Automatic Market Making
Hui Niu, Siyuan Li, Jiahao Zheng, Zhouchi Lin, Bo An, Jian Li, Jian Guo
12 min. talk | August 7th at 10:00 | Session: MTA: Finance
[+] More
[-] Less
Market making (MM) via Reinforcement Learning (RL) has attracted significant attention in financial trading. Most existing RL-based MM methods focus on optimizing single-price level strategies which fail at frequent order cancellations and loss of queue priority. By comparison, strategies involving multiple price levels align better with actual trading scenarios. However, given the complexity that multi-price level RL strategies involve a comprehensive trading action space, the challenge of effectively training RL persists. Inspired by the effective workflow of professional human market makers, we propose Imitative Market Maker (IMM), a novel RL framework leveraging knowledge from both suboptimal signal-based experts and direct policy interactions. Our framework starts with introducing effective state and action formulations that well encode information about multiprice level orders. Furthermore, IMM integrates a representation learning unit capable of capturing both short- and long-term market trends to mitigate adverse selection risk. Subsequently, IMM designs an expert strategy based on predictive signals, and trains the agent through the integration of RL and imitation learning techniques to achieve efficient learning. Extensive experimental results on four real-world market datasets demonstrate the superiority of IMM against current RL-based MM strategies.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Finance
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Representation learning
1093
Linear-Time Optimal Deadlock Detection for Efficient Scheduling in Multi-Track Railway Networks
Hastyn Doshi, Ayush Tripathi, Keshav Agarwal, Harshad Khadilkar, Shivaram Kalyanakrishnan
6 min. talk | August 6th at 15:00 | Session: MTA: Transportation
[+] More
[-] Less
The railway scheduling problem requires the computation of an operable timetable that satisfies constraints involving railway infrastructure and resource occupancy times, while minimising average delay over a set of events. Since this problem is computationally hard, practical solutions typically roll out feasible (but suboptimal) schedules one step at a time, by choosing which train to move next in every step. The choices made by such algorithms are necessarily myopic, and incur the risk of driving the system to a deadlock. To escape deadlocks, the predominant approach is to stay away from states flagged as potentially unsafe by some fast-to-compute rule R. While many choices of R guarantee deadlock avoidance, they are suboptimal in the sense of also flagging some safe states as unsafe. In this paper, we revisit the literature on process scheduling and describe a rule R0 that is (i) necessary and sufficient for deadlock detection when the network has at least two tracks in each resource (station / track section), (ii) computable in linear time, and (iii) yields lower delays when combined with existing scheduling algorithms on both synthetic and real data sets from Indian Railways.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Transportation
Planning and Scheduling -> PS: Applications
Planning and Scheduling -> PS: Markov decisions processes
Planning and Scheduling -> PS: Scheduling
1095
A Lightweight U-like Network Utilizing Neural Memory Ordinary Differential Equations for Slimming the Decoder
Quansong He, Xiaojun Yao, Jun Wu, Zhang Yi, Tao He
6 min. talk | August 8th at 10:00 | Session: CV: Segmentation
[+] More
[-] Less
In recent years, advanced U-like networks have demonstrated remarkable performance in medical image segmentation tasks. However, their drawbacks, including excessive parameters, high computational complexity, and slow inference speed, pose challenges for practical implementation in scenarios with limited computational resources. Existing lightweight U-like networks have alleviated some problems, but they often have pre-designed structures and consist of non-detachable modules, limiting their application scenarios. In this paper, we propose three plug-and-play decoders by employing different discretization methods of the neural memory Ordinary Differential Equation (nmODE). These decoders integrate features at various levels of abstraction by processing information from skip connections and performing numerical operations on upward paths. Through experiments on the PH2, ISIC2017, and ISIC2018 datasets, we embed these decoders into different U-like networks, demonstrating their effectiveness in significantly reducing the number of parameters and computation while maintaining performance. In summary, the proposed discretized nmODE decoder is capable of reducing the number of parameters by about 20% ~ 50% and computation by up to 74%, while being adaptive to all U-like networks. Our code is available at https://github.com/nayutayuki/Lightweight-nmODE-Decoders-For-U-like-networks.
List of keywords
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Machine learning for vision
Machine Learning -> ML: Convolutional networks
1098
Common-Individual Semantic Fusion for Multi-View Multi-Label Learning
Gengyu Lyu, Weiqi Kang, Haobo Wang, Zheng Li, Zhen Yang, Songhe Feng
6 min. talk | August 9th at 11:30 | Session: ML: Multi-label learning
[+] More
[-] Less
In Multi-View Multi-Label Learning, each instance is described by several heterogeneous features and associated with multiple valid labels simultaneously. Existing methods mainly focus on leveraging feature-level view fusion to capture a common representation for multi-label classifier induction. In this paper, we take a new perspective and propose a new semantic-level fusion model named Common-Individual Semantic Fusion Multi-View Multi-Label Learning Method (CISF). Different from previous feature-level fusion model, our proposed method directly focuses on semantic-level view fusion and simultaneously take both the common semantic across different views and the individual semantic of each specific view into consideration. Specifically, we first assume each view involves some common semantic labels while owns a few exclusive semantic labels. Then, the common and exclusive semantic labels are separately forced to be consensus and diverse to excavate the consistences and complementarities among different views. Afterwards, we introduce the low-rank and sparse constraint to highlight the label co-occurrence relationship of common semantics and the view-specific expression of individual semantics. We provide theoretical guarantee for the strict convexity of our method by properly setting parameters. Extensive experiments on various data sets have verified the superiority of our method.
List of keywords
Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Weakly supervised learning
1106
Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning
Peng Zhao, Yin Wang, Wei Wang, Jie Mu, Huiting Liu, Cong Wang, Xiaochun Cao
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Few-Shot Learning (FSL) aims to train a model that can generalize to recognize new classes, with each new class having only very limited training samples. Since extracting discriminative features for new classes with few samples is challenging, existing FSL methods leverage visual and semantic prior knowledge to guide discriminative feature learning. However, for meta-learning purposes, the semantic knowledge of the query set is unavailable, so their features lack discriminability. To address this problem, we propose a novel Multi-Attention based Visual-Semantic Interaction (MAVSI) approach for FSL. Specifically, we utilize spatial and channel attention mechanisms to effectively select discriminative visual features for the support set based on its ground-truth semantics while using all the support set semantics for each query set sample. Then, a relation module with class prototypes of the support set is employed to supervise and select discriminative visual features for the query set. To further enhance the discriminability of the support set, we introduce a visual-semantic contrastive learning module to promote the similarity between visual features and their corresponding semantic features. Extensive experiments on four benchmark datasets demonstrate that our proposed MAVSI could outperform existing state-of-the-art FSL methods.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Machine Learning -> ML: Meta-learning
1114
Large Language Models Are Not Strong Abstract Reasoners
Gaël Gendron, Qiming Bao, Michael Witbrock, Gillian Dobbie
6 min. talk | August 9th at 11:30 | Session: NLP: Language models
[+] More
[-] Less
Large Language Models have shown tremendous performance on a large variety of natural language processing tasks, ranging from text comprehension to common sense reasoning. However, the mechanisms responsible for this success remain opaque, and it is unclear whether LLMs can achieve human-like cognitive capabilities or whether these models are still fundamentally circumscribed. Abstract reasoning is a fundamental task for cognition, consisting of finding and applying a general pattern from few data. Evaluating deep neural architectures on this task could give insight into their potential limitations regarding reasoning and their broad generalisation abilities, yet this is currently an under-explored area. In this paper, we introduce a new benchmark for evaluating language models beyond memorization on abstract reasoning tasks. We perform extensive evaluations of state-of-the-art LLMs, showing that they currently achieve very limited performance in contrast with other natural language tasks, even when applying techniques that have been shown to improve performance on other NLP tasks. We argue that guiding LLM generation to follow causal paths could help improve the generalisation and reasoning abilities of LLMs.
List of keywords
Natural Language Processing -> NLP: Language models
Machine Learning -> ML: Evaluation
Machine Learning -> ML: Robustness
Natural Language Processing -> NLP: Question answering
1120
Toward a Manifold-Preserving Temporal Graph Network in Hyperbolic Space
Viet Quan Le, Viet Cuong Ta
12 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Hyperbolic geometry provides an ideal setting to represent the scale-free or hierarchical characteristics of an input graph naturally. Utilizing hyperbolic geometry for learning dynamic graph representation has gained a growing interest in recent years. However, the majority of hyperbolic-based approaches rely on tangent spaces to perform graph operations, which could distort the structure of the dynamic graph when the graph grows over time. To avoid the distortion in tangent space, we propose a Hyperbolic Manifold-Preserving Temporal Graph Network that works directly on the hyperbolic manifold. The model includes a graph convolution module for learning the spatial dependencies, an attention architecture for capturing the temporal properties, and a gated recurrent unit for extracting the spatio-temporal relationships. By evaluating on diverse real-world dynamic graphs, our model has achieved significant improvements in link prediction and new link prediction tasks, in comparison with other baselines.
List of keywords
Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Geometric learning
Data Mining -> DM: Mining graphs
Data Mining -> DM: Mining spatial and/or temporal data
1125
Continual Compositional Zero-Shot Learning
Yang Zhang, Songhe Feng, Jiazheng Yuan
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions with the knowledge learned from seen compositions, where each composition is composed of two primitives (attribute and object). However, existing CZSL methods are designed to learn compositions from fixed primitive set, which cannot handle the continually expanding primitive set in real-world applications. In this paper, we propose a new CZSL setting, named Continual Compositional Zero-Shot Learning (CCZSL), which requires the model to recognize unseen compositions composed of learned primitive set while continually increasing the size of learned primitive set. Contextuality and catastrophic forgetting are the main issues to be addressed in this setting. Specifically, we capture similar contextuality in compositions through several learnable Super-Primitives that can modify the invariant primitive embedding to better adapt the contextuality in the corresponding composition. Then we introduce a dual knowledge distillation loss which aims at maintaining old knowledge learned from previous sessions and avoiding overfitting of new session. We design the CCZSL evaluation protocol and conduct extensive experiments on widely used benchmarks, demonstrating the superiority of our method compared to the state-of-the-art CZSL methods.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Machine Learning -> ML: Incremental learning
1129
Causality-enhanced Discreted Physics-informed Neural Networks for Predicting Evolutionary Equations
Ye Li, Siqi Chen, Bin Shan, Sheng-Jun Huang
6 min. talk | August 6th at 11:30 | Session: ML: Applications
[+] More
[-] Less
Physics-informed neural networks (PINNs) have shown promising potential for solving partial differential equations (PDEs) using deep learning. However, PINNs face training difficulties for evolutionary PDEs, particularly for dynamical systems whose solutions exhibit multi-scale or turbulent behavior over time. The reason is that PINNs may violate the temporal causality property since all the temporal features in the PINNs loss are trained simultaneously. This paper proposes to use implicit time differencing schemes to enforce temporal causality, and use transfer learning to sequentially update the PINNs in space as surrogates for PDE solutions in different time frames. The evolving PINNs are better able to capture the varying complexities of the evolutionary equations, while only requiring minor updates between adjacent time frames. Our method is theoretically proven to be convergent if the time step is small and each PINN in different time frames is well-trained. In addition, we provide state-of-the-art (SOTA) numerical results for a variety of benchmarks for which existing PINNs formulations may fail or be inefficient. We demonstrate that the proposed method improves the accuracy of PINNs approximation for evolutionary PDEs and improves efficiency by a factor of 4–40x. The code is available at https://github.com/SiqiChen9/TL-DPINNs.
List of keywords
Machine Learning -> ML: Applications
Machine Learning -> ML: Causality
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Regression
1133
SGDCL: Semantic-Guided Dynamic Correlation Learning for Explainable Autonomous Driving
Chengtai Cao, Xinhong Chen, Jianping Wang, Qun Song, Rui Tan, Yung-Hui Li
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
By learning expressive representations, deep learning (DL) has revolutionized autonomous driving (AD). Despite significant advancements, the inherent opacity of DL models engenders public distrust, impeding their widespread adoption. For explainable autonomous driving, current studies primarily concentrate on extracting features from input scenes to predict driving actions and their corresponding explanations. However, these methods underutilize semantics and correlation information within actions and explanations (collectively called categories in this work), leading to suboptimal performance. To address this issue, we propose Semantic-Guided Dynamic Correlation Learning (SGDCL), a novel approach that effectively exploits semantic richness and dynamic interactions intrinsic to categories. SGDCL employs a semantic-guided learning module to obtain category-specific representations and a dynamic correlation learning module to adaptively capture intricate correlations among categories. Additionally, we introduce an innovative loss term to leverage fine-grained co-occurrence statistics of categories for refined regularization. We extensively evaluate SGDCL on two well-established benchmarks, demonstrating its superiority over seven state-of-the-art baselines and a large vision-language model. SGDCL significantly promotes explainable autonomous driving with up to 15.3% performance improvement and interpretable attention scores, bolstering public trust in AD.
List of keywords
Computer Vision -> CV: Interpretability and transparency
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Classification
Computer Vision -> CV: Machine learning for vision
1138
Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling
Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, Yujiu Yang
6 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Diffusion models have recently demonstrated an impressive ability to address inverse problems in an unsupervised manner. While existing methods primarily focus on modifying the posterior sampling process, the potential of the forward process remains largely unexplored. In this work, we propose Shortcut Sampling for Diffusion(SSD), a novel approach for solving inverse problems in a zero-shot manner. Instead of initiating from random noise, the core concept of SSD is to find a specific transitional state that bridges the measurement image y and the restored image x. By utilizing the shortcut path of "input – transitional state – output", SSD can achieve precise restoration with fewer steps. To derive the transitional state during the forward process, we introduce Distortion Adaptive Inversion. Moreover, we apply back projection as additional consistency constraints during the generation process. Experimentally, we demonstrate SSD’s effectiveness on multiple representative IR tasks. Our method achieves competitive results with only 30 NFEs compared to state-of-the-art zero-shot methods(100 NFEs) and outperforms them with 100 NFEs in certain tasks. Code is available at https://github.com/GongyeLiu/SSD.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Applications
1163
Expressiveness is Effectiveness: Self-supervised Fashion-aware CLIP for Video-to-Shop Retrieval
Likai Tian, Zhengwei Yang, Zechao Hu, Hao Li, Yifang Yin, Zheng Wang
6 min. talk | August 7th at 10:00 | Session: CV: Image and video retrieval 
[+] More
[-] Less
The rise of online shopping and social media has spurred the Video-to-Shop Retrieval (VSR) task, which involves identifying fashion items (e.g., clothing) in videos and matching them with identical products provided by stores. In real-world scenarios, human movement in dynamic video scenes can cause substantial morphological alterations of fashion items with aspects of occlusion, shifting viewpoints (parallax), and partial visibility (truncation). This results in those high-quality frames being overwhelmed by a vast of redundant ones, which makes the retrieval less effectiveness. To this end, this paper introduces a framework, named Self-supervised Fashion-aware CLIP (SF-CLIP), for effective VSR. The SF-CLIP enables the discovery of salient frames with high fashion expressiveness via generating pseudo-labels from three key aspects of fashion expressiveness to assess occlusion, parallax, and truncation. With such pseudo-labels, the ability of CLIP is expanded to facilitate the discovery of salient frames. Furthermore, to encompass the comprehensive representations among salient frames, a dual-branch graph-based fusion module is proposed to extract and integrate inter-frame features. Extensive experiments demonstrate the superiority of SF-CLIP over the state-of-the-arts.
List of keywords
Computer Vision -> CV: Image and video retrieval 
Computer Vision -> CV: Interpretability and transparency
1175
Sub-Adjacent Transformer: Improving Time Series Anomaly Detection with Reconstruction Error from Sub-Adjacent Neighborhoods
Wenzhen Yue, Xianghua Ying, Ruohao Guo, DongDong Chen, Ji Shi, Bowei Xing, Yuqing Zhu, Taiyan Chen
12 min. talk | August 8th at 11:30 | Session: DM: Anomaly/outlier detection
[+] More
[-] Less
In this paper, we present the Sub-Adjacent Transformer with a novel attention mechanism for unsupervised time series anomaly detection. Unlike previous approaches that rely on all the points within some neighborhood for time point reconstruction, our method restricts the attention to regions not immediately adjacent to the target points, termed sub-adjacent neighborhoods. Our key observation is that owing to the rarity of anomalies, they typically exhibit more pronounced differences from their sub-adjacent neighborhoods than from their immediate vicinities. By focusing the attention on the sub-adjacent areas, we make the reconstruction of anomalies more challenging, thereby enhancing their detectability. Technically, our approach concentrates attention on the non-diagonal areas of the attention matrix by enlarging the corresponding elements in the training stage. To facilitate the implementation of the desired attention matrix pattern, we adopt linear attention because of its flexibility and adaptability. Moreover, a learnable mapping function is proposed to improve the performance of linear attention. Empirically, the Sub-Adjacent Transformer achieves state-of-the-art performance across six real-world anomaly detection benchmarks, covering diverse fields such as server monitoring, space exploration, and water treatment.
List of keywords
Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Time series and data streams
1177
Meta-Learning via PAC-Bayesian with Data-Dependent Prior: Generalization Bounds from Local Entropy
Shiyu Liu, Wei Shi, Zenglin Xu, Shaogao Lv, Yehong Zhang, Hui Wang
6 min. talk | August 8th at 15:00 | Session: ML: Machine Learning (6/6)
[+] More
[-] Less
Meta-learning accelerates the learning process on unseen learning tasks by acquiring prior knowledge through previous related tasks. The PAC-Bayesian theory provides a theoretical framework to analyze the generalization of meta-learning to unseen tasks. However, previous works still encounter two notable limitations: (1) they merely focus on the data-free priors, which often result in inappropriate regularization and loose generalization bounds; (2) more importantly, their optimization process usually involves nested optimization problems, incurring significant computational costs. To address these issues, we derive new generalization bounds and introduce a novel PAC-Bayesian framework for meta-learning that integrates data-dependent priors. This framework enables the extraction of optimal posteriors for each task in closed form, thereby allowing us to minimize generalization bounds incorporated data-dependent priors with only a simple local entropy. The resulting algorithm, which employs SGLD for sampling from the optimal posteriors, is stable, efficient, and computationally lightweight, eliminating the need for nested optimization. Extensive experimental results demonstrate that our proposed method outperforms the other baselines.
List of keywords
Machine Learning -> ML: Bayesian learning
Machine Learning -> ML: Meta-learning
1185
DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity Recognition
Rui Yan, Hongyu Qu, Xiangbo Shu, Wenbin Li, Jinhui Tang, Tieniu Tan
6 min. talk | August 6th at 11:30 | Session: CV: Video analysis and understanding   
[+] More
[-] Less
Finetuning the large vision-language models on video data with a set of learnable prompts has shown promising performance on zero-shot activity recognition but still requires extra video data and expensive training costs. Inspired by recent Test-time Prompt Tuning (TPT) on the image domain, this work attempts to extend TPT to video data for zero-shot activity recognition. However, monotonous spatial augmentation and short class names cannot meet the need to capture diverse and complicated semantics of human behavior during prompt tuning. To this end, this work proposes a Dual Temporal-Sync Test-time Prompt Tuning (DTS-TPT) framework for zero-shot activity recognition. DTS-TPT tunes the learnable prompts appended to text inputs on video feature sequences of different temporal scales in multiple steps during test time. In each tuning step, we minimize the semantic consistency among the predictions from video feature sequences randomly augmented via AugMix with both original class names and the corresponding description generated through LLM. Compared with the state-of-the-art methods, the proposed method improves the zero-shot top-1 accuracy by approximately 2% ~ 5% on popular benchmarks. The code is available at https://github.com/quhongyu/DTS-TPT.
List of keywords
Computer Vision -> CV: Video analysis and understanding   
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
1194
On the Effects of Fairness to Adversarial Vulnerability
Cuong Tran, Keyu Zhu, Pascal Van Hentenryck, Ferdinando Fioretto
6 min. talk | August 8th at 11:30 | Session: ETF: AI Ethics, Trust, Fairness (2/2)
[+] More
[-] Less
Fairness and robustness are two important notions of learning models. Fairness ensures that models do not disproportionately harm (or benefit) some groups over others, while robustness measures the models’ resilience against small input perturbations. While equally important properties, this paper illustrates a dichotomy between fairness and robustness, and analyzes when striving for fairness decreases the model robustness to adversarial samples. The reported analysis sheds light on the factors causing such contrasting behavior, suggesting that distance to the decision boundary across groups as a key factor. Experiments on non-linear models and different architectures validate the theoretical findings. In addition to the theoretical analysis, the paper also proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
1201
Hybrid Frequency Modulation Network for Image Restoration
Yuning Cui, Mingyu Liu, Wenqi Ren, Alois Knoll
6 min. talk | August 9th at 10:00 | Session: CV: Applications
[+] More
[-] Less
Image restoration involves recovering a high-quality image from its corrupted counterpart. This paper presents an effective and efficient framework for image restoration, termed CSNet, based on “channel + spatial" hybrid frequency modulation. Different feature channels include different degradation patterns and degrees, however, most current networks ignore the importance of channel interactions. To alleviate this issue, we propose a frequency-based channel feature modulation module to facilitate channel interactions through the channel-dimension Fourier transform. Furthermore, based on our observations, we develop a multi-scale frequency-based spatial feature modulation module to refine the direct-current component of features using extremely lightweight learnable parameters. This module contains a densely connected coarse-to-fine learning paradigm for enhancing multi-scale representation learning. In addition, we introduce a frequency-inspired loss function to achieve omni-frequency learning. Extensive experiments on nine datasets demonstrate that the proposed network achieves state-of-the-art performance for three image restoration tasks, including image dehazing, image defocus deblurring, and image desnowing. The code and models are available at https://github.com/c-yn/CSNet.
List of keywords
Computer Vision -> CV: Applications
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Representation learning
1227
Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal
Lihe Li, Ruotong Chen, Ziqian Zhang, Zhichao Wu, Yi-Chen Li, Cong Guan, Yang Yu, Lei Yuan
6 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
Multi-objective reinforcement learning (MORL) approaches address real-world problems with multiple objectives by learning policies maximizing returns weighted by different user preferences. Typical methods assume the objectives remain unchanged throughout the agent’s lifetime. However, in some real-world situations, the agent may encounter dynamically changing learning objectives, i.e., different vector-valued reward functions at different learning stages. This issue has not been considered in problem formulation or algorithm design. To address this issue, we formalize the setting as a continual MORL (CMORL) problem for the first time, accounting for the evolution of objectives throughout the learning process. Subsequently, we propose Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal (CORe3), incorporating a dynamic agent network for rapid adaptation to new objectives. Moreover, we develop a reward model rehearsal technique to recover the reward signals for previous objectives, thus alleviating catastrophic forgetting. Experiments on four CMORL benchmarks showcase that CORe3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORe3 to handle situations with evolving objectives.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Incremental learning
Machine Learning -> ML: Optimization
1230
DCDet: Dynamic Cross-based 3D Object Detector
Shuai Liu, Boyang Li, Zhiyu Fang, Kai Huang
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
Recently, significant progress has been made in the research of 3D object detection. However, most prior studies have focused on the utilization of center-based or anchor-based label assignment schemes. Alternative label assignment strategies remain unexplored in 3D object detection. We find that the center-based label assignment often fails to generate sufficient positive samples for training, while the anchor-based label assignment tends to encounter an imbalanced issue when handling objects with different scales. To solve these issues, we introduce a dynamic cross label assignment (DCLA) scheme, which dynamically assigns positive samples for each object from a cross-shaped region, thus providing sufficient and balanced positive samples for training. Furthermore, to address the challenge of accurately regressing objects with varying scales, we put forth a rotation-weighted Intersection over Union (RWIoU) metric to replace the widely used L1 metric in regression loss. Extensive experiments demonstrate the generality and effectiveness of our DCLA and RWIoU-based regression loss. The Code is available at https://github.com/Say2L/DCDet.git.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)
1231
Sparse Multi-Relational Graph Convolutional Network for Multi-type Object Trajectory Prediction
Jianhui Zhang, Jun Yao, Liqi Yan, Yanhong Xu, Zheng Wang
6 min. talk | August 6th at 11:30 | Session: CV: Video analysis and understanding   
[+] More
[-] Less
Object trajectory prediction is a hot research issue with wide applications in video surveillance and autonomous driving. The previous studies consider the interaction sparsity mainly among the pedestrians instead of multi-type of objects, which brings new types of interactions and consequently superfluous ones. This paper proposes a Multi-type Object Trajectory Prediction (MOTP) method with a Sparse Multi-relational Graph Convolutional Network (SMGCN) and a novel multi-round Global Temporal Aggregation (GTA). MOTP introduces a novel adaptive sparsification and multi-scale division method to model interactions among multitype of objects. It further incorporates a Sparse Multi-relational Temporal Graph to capture the temporal division of multi-type trajectories, along with a multi-round Global Temporal Aggregation (GTA) mechanism to mitigate error accumulation, and enhances the trajectory prediction accuracy. The extensive evaluation on the ETH, UCY and SDD datasets shows that our method outperforms the typical state-of-the-art works by significant margins. Codes will be available in https://github.com/ sounio/SMGCN.
List of keywords
Computer Vision -> CV: Video analysis and understanding   
Computer Vision -> CV: Action and behavior recognition
1239
DenseKoopman: A Plug-and-Play Framework for Dense Pedestrian Trajectory Prediction
Xianbang Li, Yilong Ren, Han Jiang, Haiyang Yu, Yanlei Cui, Liang Xu
6 min. talk | August 8th at 10:00 | Session: CV: Motion and tracking
[+] More
[-] Less
Pedestrian trajectory prediction has emerged as a core component of human-robot interaction and autonomous driving. Fast and accurate prediction of surrounding pedestrians contributes to making decisions and improves safety and efficiency. However, pedestrians’ future trajectories will interact with their surrounding traffic participants. As the density of pedestrians increases, the complexity of such interactions also increases significantly, leading to an inevitable decrease in the accuracy of pedestrian trajectory prediction. To address this issue, we propose DenseKoopman, a plug-and-play framework for dense pedestrian trajectory prediction. Specifically, we introduce the Koopman operator theory to find an embedding space for a global linear approximation of a nonlinear pedestrian motion system. By encoding historical trajectories as linear state embeddings in the Koopman space, we transforms nonlinear trajectory data for pedestrians in dense scenes. This linearized representation greatly reduces the complexity of dense pedestrian trajectory prediction. Extensive experiments on pedestrian trajectory prediction benchmarks demonstrate the superiority of the proposed framework. We also conducted an analysis of the data transformation to explore how our DenseKoopman framework works with each validation method and uncovers motion patterns that may be hidden within the trajectory data. Code is available at https://github.com/lixianbang/DenseKoopman.
List of keywords
Computer Vision -> CV: Motion and tracking
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Other
1240
Boundary-aware Decoupled Flow Networks for Realistic Extreme Rescaling
Jinmin Li, Tao Dai, Jingyun Zhang, Kang Liu, Jun Wang, Shaoming Wang, Shu-Tao Xia, Rizen Guo
12 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Recently developed generative methods, including invertible rescaling network (IRN) based and generative adversarial network (GAN) based methods, have demonstrated exceptional performance in image rescaling. However, IRN-based methods tend to produce over-smoothed results, while GAN-based methods easily generate fake details, which thus hinders their real applications. To address this issue, we propose Boundary-aware Decoupled Flow Networks (BDFlow) to generate realistic and visually pleasing results. Unlike previous methods that model high-frequency information as standard Gaussian distribution directly, our BDFlow first decouples the high-frequency information into semantic high-frequency that adheres to a Boundary distribution and non-semantic high-frequency counterpart that adheres to a Gaussian distribution. Specifically, to capture semantic high-frequency parts accurately, we use Boundary-aware Mask (BAM) to constrain the model to produce rich textures, while non-semantic high-frequency part is randomly sampled from a Gaussian distribution. Comprehensive experiments demonstrate that our BDFlow significantly outperforms other state-of-the-art methods while maintaining lower complexity. Notably, our BDFlow improves the PSNR by 4.4 dB and the SSIM by 0.1 on average over GRAIN, utilizing only 74% of the parameters and 20% of the computation. The code will be available at https://github.com/THU-Kingmin/BAFlow.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Applications
1256
PointTFA: Training-Free Clustering Adaption for Large 3D Point Cloud Models
Jinmeng Wu, Chong Cao, Hao Zhang, Basura Fernando, Yanbin Hao, Hanyu Hong
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
The success of contrastive learning models like CLIP, known for aligning 2D image-text pairs, has inspired the development of triplet alignment for Large 3D Point Cloud Models (3D-PCM). Examples like ULIP integrate images, text, and point clouds into a unified semantic space. However, despite showing impressive zero-shot capabilities, frozen 3D-PCM still falls short compared to fine-tuned methods, especially when downstream 3D datasets are significantly different from upstream data. Addressing this, we propose a Data-Efficient, Training-Free 3D Adaptation method named PointTFA that adjusts ULIP outputs with representative samples. PointTFA comprises the Representative Memory Cache (RMC) for selecting a representative support set, Cloud Query Refactor (CQR) for reconstructing a query cloud using the support set, and Training-Free 3D Adapter (3D-TFA) for inferring query categories from the support set. A key advantage of PointTFA is that it introduces no extra training parameters, yet outperforms vanilla frozen ULIP, closely approaching few-shot fine-tuning training methods in downstream cloud classification tasks like ModelNet10 & 40 and ScanObjectNN. The code is available at: https://github.com/CaoChong-git/PointTFA.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Recognition (object detection, categorization)
1258
R2V-MIF: Rule-to-Vector Contrastive Learning and Multi-channel Information Fusion for Therapy Recommendation
Nengjun Zhu, Jieyun Huang, Jian Cao, Liang Hu, Zixuan Yuan, Huanjing Gao
6 min. talk | August 9th at 10:00 | Session: DM: Recommender systems
[+] More
[-] Less
Integrating data-driven and rule-based approaches is crucial for therapy recommendations since they can collaborate to achieve better performance. Medical rules, which are chains of reasoning that can infer therapies, widely exist. However, their symbolic and logical forms make integrating them with data-driven modeling technologies hard. Although rare attempts have indirectly modeled rules using data that supports them, the poor generalization of medical rules leads to inadequate supporting data and thus impairs the benefit of medical rules. To this end, we propose R2V-MIF, which fills the gap by rule-to-vector contrastive learning (R2V) and multi-channel information fusion (MIF). R2V is a data-free module and utilizes a hypergraph, including condition and result nodes, to instantiate the logic of medical rules. Each rule is reflected in the relations between nodes, and their representations are determined through contrastive learning. By taking rule representations as a bridge, MIF integrates the knowledge from medical rules, similar neighbors, and patient contents, and then recommends therapies. Extensive experiments show that R2V-MIF outperforms the baselines in several metrics using real-world medical data. Our code is available at https://github.com/vgeek-z/r2vmif.
List of keywords
Data Mining -> DM: Recommender systems
Data Mining -> DM: Mining heterogenous data
Multidisciplinary Topics and Applications -> MTA: Health and medicine
1301
Constructive Interpolation and Concept-Based Beth Definability for Description Logics via Sequents
Timothy S. Lyon, Jonas Karge
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
We introduce a constructive method applicable to a large number of description logics (DLs) for establishing the concept-based Beth definability property (CBP) based on sequent systems. Using the highly expressive DL RIQ as a case study, we introduce novel sequent calculi for RIQ-ontologies and show how certain interpolants can be computed from sequent calculus proofs, which permit the extraction of explicit definitions of implicitly definable concepts. To the best of our knowledge, this is the first sequent-based approach to computing interpolants and definitions within the context of DLs, as well as the first proof that RIQ enjoys the CBP. Moreover, due to the modularity of our sequent systems, our results hold for any restriction of RIQ, and are applicable to other DLs by suitable modifications.
List of keywords
Knowledge Representation and Reasoning -> KRR: Description logics and ontologies
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
1315
UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation
Qingdong He, Jinlong Peng, Zhengkai Jiang, Kai Wu, Xiaozhong Ji, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Mingang Chen, Yunsheng Wu
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, aligning point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results have demonstrated the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes. Code is available at https://github.com/hithqd/UniM-OV3D.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Scene analysis and understanding   
1323
KTCN: Enhancing Open-World Object Detection with Knowledge Transfer and Class-Awareness Neutralization
Xing Xi, Yangyang Huang, Jinhao Lin, Ronghua Luo
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
Open-World Object Detection (OWOD) has garnered widespread attention due to its ability to recall unannotated objects. Existing works generate pseudo-labels for the model using heuristic priors, which limits the model’s performance. In this paper, we leverage the knowledge of the large-scale visual model to provide supervision for unknown categories. Specifically, we use the Segment Anything Model (SAM) to generate raw pseudo-labels for potential objects and refine them through Intersection over Union (IOU) and the shortest bounding box side length. Nevertheless, the abundance of pseudo-labels still exacerbates the competition issue in the one-to-many label assignment. To address this, we propose the Dual Matching Label Assignment (DMLA) strategy. Furthermore, we propose the Class-Awareness Neutralizer (CAN) to reduce the model’s bias towards known categories. Evaluation results on open-world object detection benchmarks, including MS COCO and Pascal VOC, show that our method achieves nearly 200% the unknown recall rate of previous state-of-the-art (SOTA) methods, reaching 41.5 U-Recall. Additionally, our approach does not add any extra parameters, maintaining the inference speed advantage of Faster R-CNN, leading the SOTA methods based on deformable DETR at a speed of over 10 FPS. Our code is available at https://github.com/xxyzll/KTCN.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
1351
Ansatz-Agnostic Exponential Resource Saving in Variational Quantum Algorithms Using Shallow Shadows
Afrad Basheer, Yuan Feng, Christopher Ferrie, Sanjiang Li
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (2/6)
[+] More
[-] Less
Variational Quantum Algorithms (VQA) have been identified as a promising candidate for the demonstration of near-term quantum advantage in solving optimization tasks in chemical simulation, quantum information, and machine learning. The standard model of training requires a significant amount of quantum resources, which led researchers to use classical shadows to devise an alternative that consumes exponentially fewer quantum resources. However, the approach only works when the observables are local and the ansatz is the shallow Alternating Layered Ansatz (ALA), thus severely limiting its potential in solving problems such as quantum state preparation, where the ideal state might not be approximable with an ALA. In this work, we present a protocol based on shallow shadows that achieves similar levels of savings for almost any shallow ansatz studied in the literature, when combined with observables of low Frobenius norm. We show that two important applications in quantum information for which VQAs can be a powerful option, namely variational quantum state preparation and variational quantum circuit synthesis, are compatible with our protocol. We also experimentally demonstrate orders of magnitude improvement in comparison to the standard VQA model.
List of keywords
Machine Learning -> ML: Other
Machine Learning -> ML: Optimization
1358
Span-based Unified Named Entity Recognition Framework via Contrastive Learning
Hongli Mao, Xian-Ling Mao, Hanlin Tang, Yu-Ming Shang, Xiaoyan Gao, Ao-Jie Ma, Heyan Huang
6 min. talk | August 6th at 15:00 | Session: NLP: Natural Language Processing (1/3)
[+] More
[-] Less
Traditional Named Entity Recognition (NER) models are typically designed for domain-specific datasets and limited to fixed predefined types, resulting in difficulty generalizing to new domains. Recently, prompt-based generative methods attempt to mitigate this constraint by training models jointly on diverse datasets and extract specified entities via prompt instructions. However, due to autoregressive structure, these methods cannot directly model entity span and suffer from slow sequential decoding. To address these issues, we propose a novel Span-based Unified NER framework via contrastive learning (SUNER), which aligns text span and entity type representations in a shared semantic space to extract entities in parallel. Specifically, we first extract mention spans without considering entity types to better generalize across datasets. Then, by leveraging the power of contrastive learning and well-designed entity marker structure, we map candidate spans and their textual type descriptions into the same vector representation space to differentiate entities across domains. Extensive experiments on both supervised and zero/few-shot settings demonstrate that proposed SUNER model achieves better performance and higher efficiency than previous state-of-the-art unified NER models.
List of keywords
Natural Language Processing -> NLP: Named entities
Natural Language Processing -> NLP: Information extraction
1366
Hundredfold Accelerating for Pathological Images Diagnosis and Prognosis through Self-reform Critical Region Focusing
Xiaotian Yu, Haoming Luo, Jiacong Hu, Xiuming Zhang, Yuexuan Wang, Wenjie Liang, Yijun Bei, Mingli Song, Zunlei Feng
12 min. talk | August 9th at 10:00 | Session: CV: Biomedical image analysis
[+] More
[-] Less
Pathological slides are commonly gigapixel images with abundant information and are therefore significant for clinical diagnosis. However, the ultra-large size makes both training and evaluation extremely time-consuming. Most existing methods need to crop the slide into patches, which also leads to large memory requirements. In this paper, we propose the Self-reform Multilayer Transformer (SMT) to accelerate the pathological image diagnosis and prognosis. Inspired by the pathologists’ diagnostic procedure, SMT is designed to achieve layer-by-layer focus on critical regions. In the forward process, the first layer takes thumbnails as inputs and measures the significance of each patch that deserves focusing. Images from focused regions are cropped with a higher magnification and used as the input of the next layer. By analogy, the third layer inputs are focused images of second layer, which contain abundant cellular features. In addition to the forward focusing, the backward reform strategy is proposed to improve the precision of former layers. This cyclic process achieves iterative interactions for better performance on both classification and focusing. In this way, only a small part of critical patches are required in SMT for diagnosis and prognosis. Sufficient experiments demonstrate that SMT achieves hundreds times faster speed, while achieving comparable accuracy and less storage compared with existing SOTA methods.
List of keywords
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Classification
1367
Optimal Graph Learning and Nuclear Norm Maximization for Deep Cross-Domain Robust Label Propagation
Wei Wang, Hanyang Li, Ke Shi, Chao Huang, Yang Cao, Cong Wang, Xiaochun Cao
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Domain adaptation aims to achieve label transfer from a labeled source domain to an unlabeled target domain, where the two domains exhibit different distributions. Existing methods primarily concentrate on designing a feature extractor to learn better domain-invariant features, along with developing an effective classifier for reliable predictions. In this paper, we introduce optimal graph learning to generate a cross-domain graph that effectively connects the two domains, and two domain-specific graphs to capture domain-specific structures. On the one hand, we incorporate the three graphs into the label propagation (LP) classifier to enhance its robustness to distribution difference. On the other hand, we leverage the three graphs to introduce graph embedding losses, promoting the learning of locally discriminative and domain-invariant features. Furthermore, we maximize the nuclear norm of predictions in LP to enhance class diversity, thereby improving its robustness to class imbalance problem. Correspondingly, we develop an efficient algorithm to solve the associated optimization problem. Finally, we integrate the proposed LP and graph embedding losses into a deep neural network, resulting in our proposed deep cross-domain robust LP. Extensive experiments conducted on three cross-domain benchmark datasets demonstrate that our proposed approach could outperform existing state-of-the-art domain adaptation methods.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Machine Learning -> ML: Classification
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
1377
DifTraj: Diffusion Inspired by Intrinsic Intention and Extrinsic Interaction for Multi-Modal Trajectory Prediction
Yanghong Liu, Xingping Dong, Yutian Lin, Mang Ye
6 min. talk | August 8th at 10:00 | Session: CV: Motion and tracking
[+] More
[-] Less
Recent years have witnessed the success of generative adversarial networks and diffusion models in multi-model trajectory prediction. However, prevailing algorithms only explicitly consider human interaction, but ignore the modeling of human intention, yielding that the generated results deviate largely from real trajectories in some complex scenes. In this paper, we analyze the conditions of multi-modal trajectory prediction from two objective perspectives and propose a novel end-to-end framework based on the diffusion model to predict more precise and socially-acceptable trajectories for humans. Firstly, a spatial-temporal aggregation module is built to extract the extrinsic interaction features for capturing socially-acceptable behaviors. Secondly, we explicitly construct the intrinsic intention module to obtain intention features for precise prediction. Finally, we estimate a noise trajectory distribution with these two features as the initiation of diffusion model and leverage denoising process to obtain the final trajectories. Furthermore, to reduce the noise of the initiative trajectory estimation, we present a novel sample consistency loss to constrain multiple predictions. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods on ETH-UCY and SDD benchmarks, specifically achieving 19.0%/24.2% ADE/FDE improvement on ETH-UCY.
List of keywords
Computer Vision -> CV: Motion and tracking
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Machine Learning -> ML: Time series and data streams
1383
ClothPPO: A Proximal Policy Optimization Enhancing Framework for Robotic Cloth Manipulation with Observation-Aligned Action Spaces
Libing Yang, Yang Li, Long Chen
6 min. talk | August 6th at 11:30 | Session: ROB: Robotics (1/2)
[+] More
[-] Less
Vision-based robotic cloth unfolding has made great progress recently. However, prior works predominantly rely on value learning and have not fully explored policy-based techniques. Recently, the success of reinforcement learning on the large language model has shown that the policy gradient algorithm can enhance policy with huge action space. In this paper, we introduce ClothPPO, a framework that employs a policy gradient algorithm based on actor-critic architecture to enhance a pre-trained model with huge 10^6 action spaces aligned with observation in the task of unfolding clothes. To this end, we redefine the cloth manipulation problem as a partially observable Markov decision process. A supervised pre-training stage is employed to train a baseline model of our policy. In the second stage, the Proximal Policy Optimization (PPO) is utilized to guide the supervised model within the observation-aligned action space. By optimizing and updating the strategy, our proposed method increases the garment’s surface area for cloth unfolding under the soft-body manipulation task. Experimental results show that our proposed framework can further improve the unfolding performance of other state-of-the-art methods. Our project is available at https://vpx-ecnu.github.io/ClothPPO-website/.
List of keywords
Robotics -> ROB: Learning in robotics
Robotics -> ROB: Manipulation
Robotics -> ROB: Perception
Machine Learning -> ML: Reinforcement learning
1385
Rethinking Centered Kernel Alignment in Knowledge Distillation
Zikai Zhou, Yunhang Shen, Shitong Shao, Linrui Gong, Shaohui Lin
6 min. talk | August 7th at 11:30 | Session: ML: Deep learning architectures
[+] More
[-] Less
Knowledge distillation has emerged as a highly effective method for bridging the representation discrepancy between large-scale models and lightweight models. Prevalent approaches involve leveraging appropriate metrics to minimize the divergence or distance between the knowledge extracted from the teacher model and the knowledge learned by the student model. Centered Kernel Alignment (CKA) is widely used to measure representation similarity and has been applied in several knowledge distillation methods. However, these methods are complex and fail to uncover the essence of CKA, thus not answering the question of how to use CKA to achieve simple and effective distillation properly. This paper first provides a theoretical perspective to illustrate the effectiveness of CKA, which decouples CKA to the upper bound of Maximum Mean Discrepancy (MMD) and a constant term. Drawing from this, we propose a novel Relation-Centered Kernel Alignment (RCKA) framework, which practically establishes a connection between CKA and MMD. Furthermore, we dynamically customize the application of CKA based on the characteristics of each task, with less computational source yet comparable performance than the previous methods. The extensive experiments on the CIFAR-100, ImageNet-1k, and MS-COCO demonstrate that our method achieves state-of-the-art performance on almost all teacher-student pairs for image classification and object detection, validating the effectiveness of our approaches. Our code is available in https://github.com/Klayand/PCKA.
List of keywords
Machine Learning -> ML: Deep learning architectures
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
Machine Learning -> ML: Representation learning
1399
Massively Parallel Single-Source SimRanks in O(log N) Rounds
Siqiang Luo, Zulun Zhu
6 min. talk | August 8th at 15:00 | Session: DM: Data Mining (2/2)
[+] More
[-] Less
SimRank is one of the most fundamental measures that evaluate the structural similarity between two nodes in a graph and has been applied in a plethora of data mining and machine learning tasks. These tasks often involve single-source SimRank computation that evaluates the SimRank values between a source node u and all other nodes. Due to its high computation complexity, single-source SimRank computation for large graphs is notoriously challenging, and hence recent studies resort to distributed processing. To our surprise, although SimRank has been widely adopted for two decades, theoretical aspects of distributed SimRanks with provable results have rarely been studied. In this paper, we conduct a theoretical study on single-source SimRank computation in the Massive Parallel Computation (MPC) model, which is the standard theoretical framework modeling distributed systems. Existing distributed SimRank algorithms enforce either Ω(log n) communication round complexity or Ω(n) machine space for a graph of n nodes. We overcome this barrier. Particularly, given a graph of n nodes, for any query node v and constant error ϵ>3/n, we show that using O(log² log n) rounds of communication among machines is enough to compute single-source SimRank values with at most ϵ absolute errors, while each machine only needs a space sub-linear to n. To the best of our knowledge, this is the first single-source SimRank algorithm in MPC that can overcome the Θ(log n) round complexity barrier with provable result accuracy.
List of keywords
Data Mining -> DM: Parallel, distributed and cloud-based high performance mining
Data Mining -> DM: Theoretical foundations of data mining
1423
Truth Table Net: Scalable, Compact & Verifiable Neural Networks with a Dual Convolutional Small Boolean Circuit Networks Form
Adrien Benamira, Thomas Peyrin, Trevor Yap, Tristan Guérand, Bryan Hooi
12 min. talk | August 8th at 11:30 | Session: MAS: Formal verification, validation and synthesis
[+] More
[-] Less
We introduce "Truth Table net"’ (TTnet), a novel Deep Neural Network (DNN) architecture designed to provide excellent scalability/compactness trade-offs among DNNs, allowing in turn to tackle the DNN challenge of fast formal verification. TTnet is constructed using Learning Truth Table (LTT) filters, analogous to how a Deep Convolutional Neural Network (DCNN) is built upon convolutional filters. The differentiable LTT filters are unique by their dual form: they are both a neural network-based function and a small-sized truth table that can be computed within a practical time frame. This characteristic guarantees, by design and independently of the overall architecture, the ability to practically extract an efficient (in terms of the number of logical gates) and functionally equivalent Conjunctive Normal Form (CNF) Boolean logic gate implementation. This CNF circuit is even optimal when the LTT truth table’s input bit size n < 12. In particular, TTnet architecture is the first differentiable DNN with as dual form a compact logic gate representation that can scale to datasets larger than CIFAR-10: we achieve an accuracy of 41% on the ImageNet dataset while ensuring that each LTT filter truth table is fully computable within 2^{16} operations. We further compare the compactness and scalability performances of TTnet Boolean logic circuit representation to state-of-the-art differentiable logic DNNs across tabular, MNIST, and CIFAR-10 datasets. We emphasize that TTnet is the first solution to the open problem of designing differentiable convolutional neural networks with an exact dual logic gate circuit representation, bridging the gap between symbolic AI and trainable DCNNs. Finally, as improving DNNs compactness in Boolean logic circuit form reduces the complexity of their formal verification, we demonstrate TTnet effectiveness in exact sound and complete formal verification. Notably, our model achieves robustness verification in 10ms vs 100s for traditional state-of-the-art DNNs solvers.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Formal verification, validation and synthesis
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Constraint Satisfaction and Optimization -> CSO: Satisfiabilty
Machine Learning -> ML: Convolutional networks
1425
SemanticMask: A Contrastive View Design for Anomaly Detection in Tabular Data
Shuting Tao, Tongtian Zhu, Hongwei Wang, Xiangming Meng
6 min. talk | August 8th at 11:30 | Session: DM: Anomaly/outlier detection
[+] More
[-] Less
Contrastive learning based on data augmentation techniques has recently achieved substantial advancement in learning a representation well-suited for anomaly detection in image domain. However, due to the lack of spatial structure, designing effective data augmentation methods for tabular data remains challenging. Conventional techniques, such as random mask, disregard the inter-feature correlations and fail to accurately represent the data. To address this issue, we propose a novel augmentation technique called SemanticMask which leverages the semantic information from column names to generate better augmented views. SemanticMask aims to ensure that the shared information between views contains sufficient information for anomaly detection without redundancy. We analyze the relationship between shared information and anomaly detection performance and empirically demonstrate that good views for tabular anomaly detection tasks are feature-dependent. Our experiment results validate the superiority of SemanticMask over the state-of-the-art anomaly detection methods and existing augmentation techniques for tabular data. In further evaluations of the multi-class novelty detection task, SemanticMask also significantly outperforms the baseline.
List of keywords
Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Unsupervised learning
1427
Proximal Curriculum with Task Correlations for Deep Reinforcement Learning
Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla
12 min. talk | August 8th at 15:00 | Session: ML: Reinforcement learning (2/2)
[+] More
[-] Less
Curriculum design for reinforcement learning (RL) can speed up an agent’s learning process and help it learn to perform well on complex tasks. However, existing techniques typically require domain-specific hyperparameter tuning, involve expensive optimization procedures for task selection, or are suitable only for specific learning objectives. In this work, we consider curriculum design in contextual multi-task settings where the agent’s final performance is measured w.r.t. a target distribution over complex tasks. We base our curriculum design on the Zone of Proximal Development concept, which has proven to be effective in accelerating the learning process of RL agents for uniform distribution over all tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent’s learning toward the target distribution via leveraging task correlations. We theoretically justify the task selection strategy of ProCuRL-Target by analyzing a simple learning setting with REINFORCE learner model. Our experimental results across various domains with challenging target task distributions affirm the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.
List of keywords
Machine Learning -> ML: Reinforcement learning
Planning and Scheduling -> PS: Markov decisions processes
1455
A Meta-Game Evaluation Framework for Deep Multiagent Reinforcement Learning
Zun Li, Michael P. Wellman
6 min. talk | August 7th at 15:00 | Session: MAS: Multi-agent learning
[+] More
[-] Less
Evaluating deep multiagent reinforcement learning (MARL) algorithms is complicated by stochasticity in training and sensitivity of agent performance to the behavior of other agents. We propose a meta-game evaluation framework for deep MARL, by framing each MARL algorithm as a meta-strategy, and repeatedly sampling normal-form empirical games over combinations of meta-strategies resulting from different random seeds. Each empirical game captures both self-play and cross-play factors across seeds. These empirical games provide the basis for constructing a sampling distribution, using bootstrapping, over a variety of game analysis statistics. We use this approach to evaluate state-of-the-art deep MARL algorithms on a class of negotiation games. From statistics on individual payoffs, social welfare, and empirical best-response graphs, we uncover strategic relationships among self-play, population-based, model-free, and model-based MARL methods. We also investigate the effect of run-time search as a meta-strategy operator, and find via meta-game analysis that the search version of a meta-strategy generally leads to improved performance.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
1461
Unified Physical-Digital Face Attack Detection
Hao Fang, Ajian Liu, Haocheng Yuan, Junze Zheng, Dingheng Zeng, Yanhong Liu, Jiankang Deng, Sergio Escalera, Xiaoming Liu, Jun Wan, Zhen Lei
6 min. talk | August 7th at 11:30 | Session: CV: Biometrics, face, gesture and pose recognition
[+] More
[-] Less
Face Recognition (FR) systems can suffer from physical (i.e., print photo) and digital (i.e., DeepFake) attacks. However, previous related work rarely considers both situations at the same time. This implies the deployment of multiple models and thus more computational burden. The main reasons for this lack of an integrated model are caused by two factors: (1) The lack of a dataset including both physical and digital attacks which the same ID covers the real face and all attack types; (2) Given the large intra-class variance between these two attacks, it is difficult to learn a compact feature space to detect both attacks simultaneously. To address these issues, we collect a Unified physical-digital Attack dataset, called UniAttackData. The dataset consists of 1,800 participations of 2 and 12 physical and digital attacks, respectively, resulting in a total of 28,706 videos. Then, we propose a Unified Attack Detection framework based on Vision-Language Models (VLMs), namely UniAttackDetection, which includes three main modules: the Teacher-Student Prompts (TSP) module, focused on acquiring unified and specific knowledge respectively; the Unified Knowledge Mining (UKM) module, designed to capture a comprehensive feature space; and the Sample-Level Prompt Interaction (SLPI) module, aimed at grasping sample-level semantics. These three modules seamlessly form a robust unified attack detection framework. Extensive experiments on UniAttackData and three other datasets demonstrate the superiority of our approach for unified face attack detection. Dataset link: https://sites.google.com/view/face-anti-spoofing-challenge/dataset-download/uniattackdatacvpr2024
List of keywords
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Machine Learning -> ML: Multi-modal learning
1500
Memorizing Documents with Guidance in Large Language Models
Bumjin Park, Jaesik Choi
6 min. talk | August 9th at 11:30 | Session: NLP: Language models
[+] More
[-] Less
Training data plays a pivotal role in AI models. Large language models (LLMs) are trained with massive amounts of documents, and their parameters hold document-related contents. Recently, several studies identified content-specific locations in LLMs by examining the parameters. Instead of the post hoc interpretation, we propose another approach. We propose document-wise memory architecture to track document memories in training. The proposed architecture maps document representations to memory entries, which softly mask memories in the forward process of LLMs. Additionally, we propose document guidance loss, which increases the likelihood of text with document memories and reduces the likelihood of the text with the memories of other documents. Experimental results on Wikitext-103-v1 with Pythia-1B show that the proposed methods provide different memory entries for documents and high recall of document-related content in generation with trained document-wise memories.
List of keywords
Natural Language Processing -> NLP: Language models
Natural Language Processing -> NLP: Embeddings
Natural Language Processing -> NLP: Interpretability and analysis of models for NLP
Natural Language Processing -> NLP: Language generation
1505
Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors
Guozheng Li, Peng Wang, Jiajun Liu, Yikai Guo, Ke Ji, Ziyu Shang, Zijie Xu
6 min. talk | August 7th at 11:30 | Session: NLP: Information extraction
[+] More
[-] Less
Relation extraction (RE) is an important task that aims to identify the relationships between entities in texts. While large language models (LLMs) have revealed remarkable in-context learning (ICL) capability for general zero and few-shot learning, recent studies indicate that current LLMs still struggle with zero and few-shot RE. Previous studies are mainly dedicated to design prompt formats and select good examples for improving ICL-based RE. Although both factors are vital for ICL, if one can fundamentally boost the ICL capability of LLMs in RE, the zero and few-shot RE performance via ICL would be significantly improved. To this end, we introduce Micre (Meta In-Context learning of LLMs for Relation Extraction), a new meta-training framework for zero and few-shot RE where an LLM is tuned to do ICL on a diverse collection of RE datasets (i.e., learning to learn in context for RE). Through meta-training, the model becomes more effectively to learn a new RE task in context by conditioning on a few training examples with no parameter updates or task-specific templates at inference time, enabling better zero and few-shot task generalization. We experiment Micre on various LLMs with different model scales and 12 public RE datasets, and then evaluate it on unseen RE benchmarks under zero and few-shot settings. Micre delivers comparable or superior performance compared to a range of baselines including supervised fine-tuning and typical in-context learning methods. We find that the gains are particular significant for larger model scales, and using a diverse set of the meta-training RE datasets is key to improvements. Empirically, we show that Micre can transfer the relation semantic knowledge via relation label name during inference on target RE datasets.
List of keywords
Natural Language Processing -> NLP: Information extraction
1525
Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization
Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, Ming Li
6 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
In offline Reinforcement Learning (RL), the pre-trained policies are utilized for initialization and subsequent online fine-tuning. However, existing methods suffer from instability and low sample efficiency compared to pure online learning. This paper identifies these limitations stemming from direct policy initialization using offline-trained policy models. We propose Continual Policy Revitalization (CPR) as a novel efficient, stable fine-tuning method. CPR incorporates a periodic policy revitalization technique, restoring the overtrained policy network to full learning capacity while ensuring stable initial performance. This approach enables fine-tuning without being adversely affected by low-quality pre-trained policies. In contrast to previous research, CPR initializes the new policy with an adaptive policy constraint in policy optimization. Such optimization keeps the new policy close to behavior policy constructed from historical policies. This contributes to stable policy improvement and optimal converged performance. Practically, CPR can seamlessly integrate into existing offline RL algorithms with minimal modification. We empirically validate the effectiveness of our method through extensive experiments, demonstrating substantial improvements in learning stability and efficiency compared to previous approaches. Our code is available at https://github.com/LAMDA-RL/CPR.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Offline reinforcement learning
1531
Optimal Auction Design with User Coupons in Advertising Systems
Xiaodong Liu, Zhikang Fan, Yiming Ding, Yuan Guo, Lihua Zhang, Changcheng Li, Dongying Kong, Han Li, Weiran Shen
6 min. talk | August 8th at 15:00 | Session: GTEP: Game Theory and Economic Paradigms
[+] More
[-] Less
Online advertising is a major revenue source for most Internet companies. The advertising opportunities are usually sold to advertisers through auctions that take into account the bids of the advertisers and the click-through rates (CTRs) and the conversion rates (CVRs) of the users. Standard auction design theory perceives both the CTRs and the CVRs as constants. We consider a new auction mechanism that offers coupons to users when displaying the ads. Such coupons allow the user to buy the advertisers’ products or services at a lower price, which increases both the CTRs and the CVRs of the ads. In this paper, we formulate the problem mathematically and perform a systematic analysis. We characterize the set of individually rational and incentive compatible mechanisms in our setting. Based on the characterization, we identify the optimal strategy of offering coupons that maximizes the platform’s expected revenue. We also conduct extensive experiments on both synthetic data and industrial data. Our experiment results show that our mechanism significantly improves both the revenue and welfare of the platform, thereby creating a win-win situation for all parties including the platform, the advertisers, and the user.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Game Theory and Economic Paradigms -> GTEP: Mechanism design
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
1532
FedPFT: Federated Proxy Fine-Tuning of Foundation Models
Zhaopeng Peng, Xiaoliang Fan, Yufan Chen, Zheng Wang, Shirui Pan, Chenglu Wen, Ruisheng Zhang, Cheng Wang
6 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
Adapting Foundation Models (FMs) for down- stream tasks through Federated Learning (FL) emerges a promising strategy for protecting data privacy and valuable FMs. Existing methods fine- tune FM by allocating sub-FM to clients in FL, however, leading to suboptimal performance due to insufficient tuning and inevitable error accumula- tions of gradients. In this paper, we propose Feder- ated Proxy Fine-Tuning (FedPFT), a novel method enhancing FMs adaptation in downstream tasks through FL by two key modules. First, the sub-FM construction module employs a layer-wise com- pression approach, facilitating comprehensive FM fine-tuning across all layers by emphasizing those crucial neurons. Second, the sub-FM alignment module conducts a two-step distillations—layer- level and neuron-level—before and during FL fine- tuning respectively, to reduce error of gradient by accurately aligning sub-FM with FM under theo- retical guarantees. Experimental results on seven commonly used datasets (i.e., four text and three vi- sion) demonstrate the superiority of FedPFT. Our code is available at https://github.com/pzp-dzd/FedPFT.
List of keywords
Machine Learning -> ML: Federated learning
Machine Learning -> ML: Trustworthy machine learning
Multidisciplinary Topics and Applications -> MTA: Security and privacy
1540
Unsupervised Anomaly Detection via Masked Diffusion Posterior Sampling
Di Wu, Shicai Fan, Xue Zhou, Li Yu, Yuzhong Deng, Jianxiao Zou, Baihong Lin
6 min. talk | August 8th at 11:30 | Session: DM: Anomaly/outlier detection
[+] More
[-] Less
Reconstruction-based methods have been commonly used for unsupervised anomaly detection, in which a normal image is reconstructed and compared with the given test image to detect and locate anomalies. Recently, diffusion models have shown promising applications for anomaly detection due to their powerful generative ability. However, these models lack strict mathematical support for normal image reconstruction and unexpectedly suffer from low reconstruction quality. To address these issues, this paper proposes a novel and highly-interpretable method named Masked Diffusion Posterior Sampling (MDPS). In MDPS, the problem of normal image reconstruction is mathematically modeled as multiple diffusion posterior sampling for normal images based on the devised masked noisy observation model and the diffusion-based normal image prior under Bayesian framework. Using a metric designed from pixel-level and perceptual-level perspectives, MDPS can effectively compute the difference map between each normal posterior sample and the given test image. Anomaly scores are obtained by averaging all difference maps for multiple posterior samples. Exhaustive experiments on MVTec and BTAD datasets demonstrate that MDPS can achieve state-of-the-art performance in normal image reconstruction quality as well as anomaly detection and localization.
List of keywords
Data Mining -> DM: Anomaly/outlier detection
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation 
1542
Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent Transformer
Kepeng Xu, Li Xu, Gang He, Wenxin Yu, Yunsong Li
6 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on https://github.com/kepengxu/PGTFormer.
List of keywords
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Computer Vision -> CV: Applications
1546
Unified Evidence Enhancement Inference Framework for Fake News Detection
Lianwei Wu, Linyong Wang, Yongqiang Zhao
6 min. talk | August 8th at 11:30 | Session: NLP: Applications
[+] More
[-] Less
The current approaches for fake news detection are mainly devoted to extracting candidate evidence from comments (or external articles) and establishing interactive reasoning with the news itself to verify the falsehood of the news. However, they still have several drawbacks: 1) The interaction object is coarse-grained, which mainly drives the entire news to participate in interaction, but ignores the learning of potential suspicious segments in news; 2) The reasoning ways are relatively single, making it difficult to explore the various possible correlations between news and candidate evidence. To this end, we propose Unified Evidence Enhancement Inference framework (UEEI) to discover and infer high-quality evidence to reveal the false parts of news for detection. Specifically, UEEI first promotes the interaction fusion between comments and news from the perspectives of semantic and emotion, thereby learning potential suspicious fragments in news. Then, the model constructs entity-level and relationship-level retrievals to screen sufficient candidate evidence from external sources. Finally, we measure coherence between suspicious fragments and candidate evidence by multi-view reasoning, and further infer explainable evidence that discovers the false parts of news. Experiments on three public datasets confirm the effectiveness and interpretability of our UEEI.
List of keywords
Natural Language Processing -> NLP: Applications
Game Theory and Economic Paradigms -> GTEP: Computational social choice
Multidisciplinary Topics and Applications -> MTA: Social sciences
1551
An NCDE-based Framework for Universal Representation Learning of Time Series
Zihan Liu, Bowen Du, Junchen Ye, Xianqing Wen, Leilei Sun
6 min. talk | August 7th at 11:30 | Session: ML: Representation learning
[+] More
[-] Less
Exploiting self-supervised learning (SSL) to extract the universal representations of time series could not only capture the natural properties of time series but also offer huge help to the downstream tasks. Nevertheless, existing time series representation learning (TSRL) methods face challenges in attaining universality. Indeed, existing methods relying solely on one SSL strategy (either contrastive learning (CL) or generative) often fall short in capturing rich semantic information for various downstream tasks. Moreover, time series exhibit diverse distributions and inherent characteristics, particularly with the common occurrence of missing values, posing a notable challenge for existing backbones in effectively handling such diverse time series data. To bridge these gaps, we propose CTRL, a framework for universal TSRL. For the first time, we employ Neural Controlled Differential Equation (NCDE) as the backbone for TSRL, which captures the continuous processes and exhibits robustness to missing data. Additionally, a dual-task SSL strategy, integrating both reconstruction and contrasting tasks, is proposed to enrich the semantic information of the learned representations. Furthermore, novel hard negative construction and false negative elimination mechanisms are proposed to improve sampling efficiency and reduce sampling bias in CL. Finally, extensive experiments demonstrate the superiority of CTRL in forecasting, classification, and imputation tasks, particularly its outstanding robustness to missing data.
List of keywords
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Time series and data streams
1564
CausVSR: Causality Inspired Visual Sentiment Recognition
Xinyue Zhang, Zhaoxia Wang, Hailing Wang, Jing Xiang, Chunwei Wu, Guitao Cao
6 min. talk | August 8th at 11:30 | Session: HAI: Cognitive modeling
[+] More
[-] Less
Visual Sentiment Recognition (VSR) is an evolving field that aims to detect emotional tendencies within visual content. Despite its growing significance, detecting emotions depicted in visual content, such as images, faces challenges, notably the emergence of misleading or spurious correlations of the contextual information. In response to these challenges, we propose a causality inspired VSR approach, called CausVSR. CausVSR is rooted in the fundamental principles of Emotional Causality theory, mimicking the human process from receiving emotional stimuli to deriving emotional states. CausVSR takes a deliberate stride toward conquering the VSR challenges. It harnesses the power of a structural causal model, intricately designed to encapsulate the dynamic causal interplay between visual content and their corresponding pseudo sentiment regions. This strategic approach allows for a deep exploration of contextual information, elevating the accuracy of emotional inference. Additionally, CausVSR utilizes a global category elicitation module, strategically employed to execute front-door adjustment techniques, effectively detecting and handling spurious correlations. Experiments, conducted on four widely-used datasets, demonstrate CausVSR’s superiority in enhancing emotion perception within VSR, surpassing existing methods.
List of keywords
Humans and AI -> HAI: Cognitive modeling
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Deep learning architectures
1566
Modeling Selective Feature Attention for Lightweight Text Matching
Jianxiang Zang, Hui Liu
6 min. talk | August 7th at 15:00 | Session: NLP: Natural Language Processing (2/3)
[+] More
[-] Less
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
List of keywords
Natural Language Processing -> NLP: Natural language semantics
Machine Learning -> ML: Attention models
Machine Learning -> ML: Deep learning architectures
Natural Language Processing -> NLP: Embeddings
1570
By Fair Means or Foul: Quantifying Collusion in a Market Simulation with Deep Reinforcement Learning
Michael Schlechtinger, Damaris Kosack, Franz Krause, Heiko Paulheim
6 min. talk | August 6th at 15:00 | Session: ETF: AI Ethics, Trust, Fairness (1/2)
[+] More
[-] Less
In the rapidly evolving landscape of eCommerce, Artificial Intelligence (AI) based pricing algorithms, particularly those utilizing Reinforcement Learning (RL), are becoming increasingly prevalent. This rise has led to an inextricable pricing situation with the potential for market collusion. Our research employs an experimental oligopoly model of repeated price competition, systematically varying the environment to cover scenarios from basic economic theory to subjective consumer demand preferences. We also introduce a novel demand framework that enables the implementation of various demand models, allowing for a weighted blending of different models. In contrast to existing research in this domain, we aim to investigate the strategies and emerging pricing patterns developed by the agents, which may lead to a collusive outcome. Furthermore, we investigate a scenario where agents cannot observe their competitors’ prices. Finally, we provide a comprehensive legal analysis across all scenarios. Our findings indicate that RL-based AI agents converge to a collusive state characterized by the charging of supracompetitive prices, without necessarily requiring inter-agent communication. Implementing alternative RL algorithms, altering the number of agents or simulation settings, and restricting the scope of the agents’ observation space does not significantly impact the collusive market outcome behavior.
List of keywords
AI Ethics, Trust, Fairness -> ETF: AI and law, governance, regulation
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Machine Learning -> ML: Reinforcement learning
Multidisciplinary Topics and Applications -> MTA: Economics
1578
Primal Grammars Driven Automated Induction
Adel Bouhoula, Miki Hermann
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
Automated induction is a powerful method for the validation of critical systems. However, the inductive proof process faces major challenges: it is undecidable and diverges even with small examples. Previous methods have proposed ad-hoc heuristics to speculate on additional lemmas that hopefully stop the divergence. Although these methods have succeeded in proving interesting theorems, they have significant limitations: in particular, they often fail to find appropriate lemmas, and the lemmas they provide may not be valid. We present a new method that allows us to perform inductive proofs in conditional theories. This method automatically detects divergence in proof traces and derives primal grammars as well as new lemmas that schematize the divergent sequence. This new construction allows us to break the divergence and complete the proof. Our method is presented as a set of inference rules whose soundness and refutational completeness have been formally proved. Unlike previous methods, our method is fully automated and has no risk of over-generalization. Moreover, our technique for capturing and schematizing divergence represents the most general decidable schematization, with respect to description power, among all known schematizations. Our method has been implemented in C++ and successfully proved over fifty complex examples that fail with well-known theorem provers (e.g., ACL2, Isabelle, PVS, SPIKE) and related methods for handling divergence in proofs by induction. Our method represents a significant contribution to the field of automated reasoning as it can be integrated with existing automated and interactive inductive proof systems to enhance their performance. Moreover, it has the potential to substantially reduce the time needed for the verification of critical systems.
List of keywords
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
1593
DFMDA-Net: Dense Fusion and Multi-dimension Aggregation Network for Image Restoration
Huibin Yan, Shuoyao Wang
6 min. talk | August 9th at 11:30 | Session: CV: Machine learning for vision
[+] More
[-] Less
The U-shape (encoder-decoder) architecture, combined with effective blocks, has shown significant success in image restoration. In U-shape models, there is insufficient focus on the feature fusion problem between encoder and decoder features at the same level. Current methods often employ simplistic operations like summation or concatenation, which makes it difficult to strike a balance between performance and complexity. To address this issue, we propose a compression-in-the-middle mechanism, termed Integration-Compression-Integration (ICI), which effectively conducts dense fusion and avoids information loss. From the block design perspective, we design a multi-dimension aggregation (MDA) mechanism, capable of effectively aggregating features from both the channel and spatial dimension. Combining the IntegrationCompression-Integration feature fusion and the multi-dimension aggregation, our dense fusion and multi-dimension aggregation network (DFMDANet) achieves superior performance over state-ofthe-art algorithms on 16 benchmarking datasets for numerous image restoration tasks.
List of keywords
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Representation learning
Machine Learning -> ML: Attention models
Machine Learning -> ML: Convolutional networks
1601
Robust Losses for Decision-Focused Learning
Noah Schutte, Krzysztof Postek, Neil Yorke-Smith
12 min. talk | August 9th at 10:00 | Session: ML: Robustness
[+] More
[-] Less
Optimization models used to make discrete decisions often contain uncertain parameters that are context-dependent and estimated through prediction. To account for the quality of the decision made based on the prediction, decision-focused learning (end-to-end predict-then-optimize) aims at training the predictive model to minimize regret, i.e., the loss incurred by making a suboptimal decision. Despite the challenge of the gradient of this loss w.r.t. the predictive model parameters being zero almost everywhere for optimization problems with a linear objective, effective gradient-based learning approaches have been proposed to minimize the expected loss, using the empirical loss as a surrogate. However, empirical regret can be an ineffective surrogate because empirical optimal decisions can vary substantially from expected optimal decisions. To understand the impact of this deficiency, we evaluate the effect of aleatoric and epistemic uncertainty on the accuracy of empirical regret as a surrogate. Next, we propose three novel loss functions that approximate expected regret more robustly. Experimental results show that training two state-of-the-art decision-focused learning approaches using robust regret losses improves test-sample empirical regret in general while keeping computational time equivalent relative to the number of training epochs.
List of keywords
Machine Learning -> ML: Robustness
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Regression
Machine Learning -> ML: Optimization
1614
Protecting Object Detection Models from Model Extraction Attack via Feature Space Coverage
Zeyu Li, Yuwen Pu, Xuhong Zhang, Yu Li, Jinbao Li, Shouling Ji
6 min. talk | August 7th at 10:00 | Session: ETF: Safety and robustness
[+] More
[-] Less
The model extraction attack is an attack pattern aimed at stealing well-trained machine learning models’ functionality or privacy information. With the gradual popularization of AI-related technologies in daily life, various well-trained models are being deployed. As a result, these models are considered valuable assets and attractive to model extraction attackers. Currently, the academic community primarily focuses on defense for model extraction attacks in the context of classification, with little attention to the more commonly used task scenario of object detection. Therefore, we propose a detection framework targeting model extraction attacks against object detection models in this paper. The framework first locates suspicious users based on feature coverage in query traffic and uses an active verification module to confirm whether the identified suspicious users are attackers. Through experiments conducted in multiple task scenarios, we validate the effectiveness and detection efficiency of the proposed method.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Computer Vision -> CV: Recognition (object detection, categorization)
1625
Exploring Cross-Domain Few-Shot Classification via Frequency-Aware Prompting
Tiange Zhang, Qing Cai, Feng Gao, Lin Qi, Junyu Dong
6 min. talk | August 9th at 11:30 | Session: ML: Classification
[+] More
[-] Less
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision, which thus degenerates the robustness of learned inductive bias since high-frequency information is vulnerable and easy to be disturbed by noisy information. Hence in this paper, we make one of the first attempts to propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification, which can let networks simulate the human visual perception of selecting different frequency cues when facing new recognition tasks. Specifically, a frequency-aware prompting mechanism is first proposed, in which high-frequency components of the decomposed source image are switched either with normal distribution sampling or zeroing to get frequency-aware augment samples. Then, a mutual attention module is designed to learn generalizable inductive bias under CD-FSL settings. More importantly, the proposed method is a plug-and-play module that can be directly applied to most off-the-shelf CD-FLS methods. Experimental results on CD-FSL benchmarks demonstrate the effectiveness of our proposed method as well as robustly improve the performance of existing CD-FLS methods. Resources at https://github.com/tinkez/FAP_CDFSC.
List of keywords
Machine Learning -> ML: Classification
Machine Learning -> ML: Few-shot learning
Machine Learning -> ML: Meta-learning
Machine Learning -> ML: Multi-task and transfer learning
1627
Bridge to Non-Barrier Communication: Gloss-Prompted Fine-Grained Cued Speech Gesture Generation with Diffusion Model
Wentao Lei, Li Liu, Jun Wang
6 min. talk | August 8th at 10:00 | Session: NLP: Speech
[+] More
[-] Less
Cued Speech (CS) is an advanced visual phonetic encoding system that integrates lip reading with hand codings, enabling people with hearing impairments to communicate efficiently. CS video generation aims to produce specific lip and gesture movements of CS from audio or text inputs. The main challenge is that given limited CS data, we strive to simultaneously generate fine-grained hand and finger movements, as well as lip movements, meanwhile the two kinds of movements need to be asynchronously aligned. Existing CS generation methods are fragile and prone to poor performance due to template-based statistical models and careful hand-crafted pre-processing to fit the models. Therefore, we propose a novel Gloss-prompted Diffusion-based CS Gesture generation framework (called GlossDiff). Specifically, to integrate additional linguistic rules knowledge into the model. we first introduce a bridging instruction called Gloss, which is an automatically generated descriptive text to establish a direct and more delicate semantic connection between spoken language and CS gestures. Moreover, we first suggest rhythm is an important paralinguistic feature for CS to improve the communication efficacy. Therefore, we propose a novel Audio-driven Rhythmic Module (ARM) to learn rhythm that matches audio speech. Moreover, in this work, we design, record, and publish the first Chinese CS dataset with four CS cuers. Extensive experiments demonstrate that our method quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods. We will release the code and data at glossdiff.github.io/.
List of keywords
Natural Language Processing -> NLP: Speech
Computer Vision -> CV: Applications
1631
Provable Acceleration of Nesterov’s Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks
Xin Liu, Wei Tao, Wei Li, Dazhi Zhan, Jun Wang, Zhisong Pan
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (1/6)
[+] More
[-] Less
Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum during training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including the heavy ball (HB) method and Nesterov’s accelerated gradient (NAG) method, are the workhorse of first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits superior performance than HB. However, current theoretical works fail to distinguish their convergence difference in training neural networks. To fill this gap, we consider the training problem of the two-layer ReLU neural network under over-parameterization and random initialization. Leveraging high-resolution dynamical systems and neural tangent kernel (NTK) theory, our result not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also provides the first theoretical guarantee for the acceleration of NAG over HB in training neural networks. Finally, we validate our theoretical results on three benchmark datasets.
List of keywords
Machine Learning -> ML: Theory of deep learning
Machine Learning -> ML: Optimization
1635
Spatial-Temporal-Decoupled Masked Pre-training for Spatiotemporal Forecasting
Haotian Gao, Renhe Jiang, Zheng Dong, Jinliang Deng, Yuxin Ma, Xuan Song
6 min. talk | August 8th at 15:00 | Session: ML: Time series and data streams
[+] More
[-] Less
Spatiotemporal forecasting techniques are significant for various domains such as transportation, energy, and weather. Accurate prediction of spatiotemporal series remains challenging due to the complex spatiotemporal heterogeneity. In particular, current end-to-end models are limited by input length and thus often fall into spatiotemporal mirage, i.e., similar input time series followed by dissimilar future values and vice versa. To address these problems, we propose a novel self-supervised pre-training framework Spatial-Temporal-Decoupled Masked Pre-training (STD-MAE) that employs two decoupled masked autoencoders to reconstruct spatiotemporal series along the spatial and temporal dimensions. Rich-context representations learned through such reconstruction could be seamlessly integrated by downstream predictors with arbitrary architectures to augment their performances. A series of quantitative and qualitative evaluations on four widely used benchmarks (PEMS03, PEMS04, PEMS07, and PEMS08) are conducted to validate the state-of-the-art performance of STD-MAE. Codes are available at https://github.com/Jimmy-7664/STD-MAE.
List of keywords
Machine Learning -> ML: Time series and data streams
Data Mining -> DM: Mining spatial and/or temporal data
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning
1645
Let’s Start Over: Retraining with Selective Samples for Generalized Category Discovery
Zhimao Peng, Enguang Wang, Xialei Liu, Ming-Ming Cheng
6 min. talk | August 8th at 10:00 | Session: ML: Clustering
[+] More
[-] Less
Generalized Category Discovery (GCD) presents a realistic and challenging problem in open-world learning. Given a par- tially labeled dataset, GCD aims to categorize unlabeled data by leveraging visual knowledge from the labeled data, where the unlabeled data includes both known and unknown classes. Existing methods based on parametric/non-parametric classi- fiers attempt to generate pseudo-labels/relationships for the unlabeled data to enhance representation learning. However, the lack of ground-truth labels for novel classes often leads to noisy pseudo-labels/relationships, resulting in suboptimal representation learning. This paper introduces a novel method using Nearest Neighbor Distance-aware Label Consistency sample selection. It creates class-consistent subsets for novel class sample clusters from the current GCD method, acting as “pseudo-labeled sets” to mitigate representation bias. We propose progressive supervised representation learning with selected samples to optimize the trade-off between quantity and purity in each subset. Our method is versatile and appli- cable to various GCD methods, whether parametric or non- parametric. We conducted extensive experiments on multiple generic and fine-grained image classification datasets to eval- uate the effectiveness of our approach. The results demon- strate the superiority of our method in achieving improved performance in generalized category discovery tasks.
List of keywords
Machine Learning -> ML: Clustering
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Machine Learning -> ML: Classification
Computer Vision -> CV: Recognition (object detection, categorization)
1653
Convexity Certificates for Symbolic Tensor Expressions
Paul G. Rump, Niklas Merk, Julien Klaus, Maurice Wenig, Joachim Giesen
6 min. talk | August 8th at 10:00 | Session: CSO: Constraint Satisfaction and Optimization
[+] More
[-] Less
Knowing that a function is convex ensures that any local minimum is also a global minimum. Here, we implement an approach to certify the convexity of twice-differentiable functions by certifying that their second-order derivative is positive semidefinite. Both the computation of the second-order derivative and the certification of positive semidefiniteness are done symbolically. Previous implementations of this approach assume that the function to be minimized takes scalar or vector inputs, meaning that the second-order derivative is at most a matrix. However, the input of many machine learning problems is naturally given in the form of matrices or higher order tensors, in which case the second-order derivative becomes a tensor of at least fourth order. The familiar linear algebra notations and known rules for determining whether a matrix is positive semidefinite are not sufficient to deal with these higher order expressions. Here, we present a formal language for tensor expressions that allows us to generalize semidefiniteness to higher-order tensors and thereby certify the convexity of a broader set of functions.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Solvers and tools
Machine Learning -> ML: Optimization
Machine Learning -> ML: Symbolic methods
1688
Attribution Quality Metrics with Magnitude Alignment
Chase Walker, Dominic Simon, Kenny Chen, Rickard Ewetz
6 min. talk | August 6th at 11:30 | Session: ETF: Explainability and interpretability
[+] More
[-] Less
Attribution algorithms play an instrumental role in human interpretation of AI models. The methods measure the importance of the input features to the model output decision, which can be displayed as an attribution map for image classifiers. Perturbation tests are the state-of-the-art approach to evaluate the quality of an attribution map. Unfortunately, we observe that perturbation tests fail to consider attribution magnitude, which translates into inconsistent quality scores. In this paper, we propose Magnitude Aligned Scoring (MAS), a new attribution quality metric that measures the alignment between the magnitude of the attributions and the model response. In particular, the metric accounts for both the relative ordering and the magnitude of the pixels within an attribution. In the experimental evaluation, we compare the MAS metric with existing metrics across a wide range of models, datasets, attributions, and evaluations. The results demonstrate that the MAS metric is 4x more sensitive to attribution changes, 2x more consistent, and 1.6x more invariant to baseline modifications. Our code and the referenced appendix are publicly available via https://github.com/chasewalker26/Magnitude-Aligned-Scoring.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Computer Vision -> CV: Interpretability and transparency
Machine Learning -> ML: Explainable/Interpretable machine learning
1700
MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement
Zifeng Wang, Chufan Gao, Cao Xiao, Jimeng Sun
6 min. talk | August 9th at 11:30 | Session: MTA: Health and medicine
[+] More
[-] Less
Tabular data prediction has been employed in medical applications such as patient health risk prediction. However, existing methods usually revolve around the algorithm design while overlooking the significance of data engineering. Medical tabular datasets frequently exhibit significant heterogeneity across different sources, with limited sample sizes per source. As such, previous predictors are often trained on manually curated small datasets that struggle to generalize across different tabular datasets during inference. This paper proposes to scale medical tabular data predictors (MediTab) to various tabular inputs with varying features. The method uses a data engine that leverages large language models (LLMs) to consolidate tabular samples to overcome the barrier across tables with distinct schema. It also aligns out-domain data with the target task using a "learn, annotate, and refinement” pipeline. The expanded training data then enables the pre-trained MediTab to infer for arbitrary tabular input in the domain without fine-tuning, resulting in significant improvements over supervised baselines: it reaches an average ranking of 1.57 and 1.00 on 7 patient outcome prediction datasets and 3 trial outcome prediction datasets, respectively. In addition, MediTab exhibits impressive zero-shot performances: it outperforms supervised XGBoost models by 8.9% and 17.2% on average in two prediction tasks, respectively.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Health and medicine
Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Multidisciplinary Topics and Applications -> MTA: Life sciences
1703
Towards Robust Multi-Label Learning against Dirty Label Noise
Yuhai Zhao, Yejiang Wang, Zhengkui Wang, Wen Shan, Miaomiao Huang, Meixia Wang, Min Huang, Xingwei Wang
6 min. talk | August 9th at 11:30 | Session: ML: Multi-label learning
[+] More
[-] Less
In multi-label learning, one of the major challenges is that the data are associated with label noise including the random noisy labels (e.g., data encoding errors) and noisy labels created by annotators (e.g., missing, extra, or error label), where noise is promoted by different structures (e.g., gaussian, sparse or subjective). Existing methods are tailored to handle noise with one specific structure. However, they lack of consideration of the fact that the data are always with dirty noisy labels, simutaneously gaussian, sparse and subjective, in real applications. In this paper, we formalize the multi-label learning with dirty noise as a new learning problem, namely Noisy Multi-label Learning (NML). To solve the NML problem, we decompose a corrupted label matrix as the noise matrix plus a true label matrix (maybe high-rank). For the noise matrix, a mixed norm penalty is developed as regularizer for dirty noise distribution. Under this norm, the conditions required for exact noise recovery are provided theoretically. For the true label matrix that is not necessarily low-rank, we apply a non-linear mapping to ensure its low-rankness such that the high-order label correlation can be utilized. Experimental results show that the proposed method outperforms the state-of-the-art methods significantly.
List of keywords
Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Optimization
Machine Learning -> ML: Weakly supervised learning
1708
EMOTE: An Explainable Architecture for Modelling the Other through Empathy
Manisha Senadeera, Thommen Karimpanal George, Stephan Jacobs, Sunil Gupta, Santu Rana
12 min. talk | August 6th at 11:30 | Session: ML: Multiagent Reinforcement Learning
[+] More
[-] Less
Empathy allows us to assume others are like us and have goals analogous to our own. This can also at times be applied to multi-agent games – e.g. Agent 1’s attraction to green balls is analogous to Agent 2’s attraction to red balls. Drawing inspiration from empathy, we propose EMOTE, a simple and explainable inverse reinforcement learning (IRL) approach designed to model another agent’s action-value function and from it, infer a unique reward function. This is done by referencing the learning agent’s own action value function, removing the need to maintain independent action-value estimates for the modelled agents whilst simultaneously addressing the ill-posed nature of IRL by inferring a unique reward function. We experiment on minigrid environments showing EMOTE: (a) produces more consistent reward estimates relative to other IRL baselines (b) is robust in scenarios with composite reward and action-value functions (c) produces human-interpretable states, helping to explain how the agent views other agents.
List of keywords
Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
Machine Learning -> ML: Reinforcement learning
1716
Towards Dynamic Trend Filtering through Trend Point Detection with Reinforcement Learning
Jihyeon Seong, Sekwang Oh, Jaesik Choi
6 min. talk | August 8th at 10:00 | Session: DM: Mining spatial and/or temporal data (2/2)
[+] More
[-] Less
Trend filtering simplifies complex time series data by applying smoothness to filter out noise while emphasizing proximity to the original data. However, existing trend filtering methods fail to reflect abrupt changes in the trend due to `approximateness,’ resulting in constant smoothness. This approximateness uniformly filters out the tail distribution of time series data, characterized by extreme values, including both abrupt changes and noise. In this paper, we propose Trend Point Detection formulated as a Markov Decision Process (MDP), a novel approach to identifying essential points that should be reflected in the trend, departing from approximations. We term these essential points as Dynamic Trend Points (DTPs) and extract trends by interpolating them. To identify DTPs, we utilize Reinforcement Learning (RL) within a discrete action space and a forecasting sum-of-squares loss function as a reward, referred to as the Dynamic Trend Filtering network (DTF-net). DTF-net integrates flexible noise filtering, preserving critical original subsequences while removing noise as required for other subsequences. We demonstrate that DTF-net excels at capturing abrupt changes compared to other trend filtering algorithms and enhances forecasting performance, as abrupt changes are predicted rather than smoothed out.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Reinforcement learning
1752
3DBench: A Scalable 3D Benchmark and Instruction-Tuning Dataset
Junjie Zhang, Tianci Hu, Xiaoshui Huang, Yongshun Gong, Dan Zeng
6 min. talk | August 6th at 11:30 | Session: CV: 3D computer vision (1/2)
[+] More
[-] Less
Evaluating the performance of Multi-modal Large Language Models (MLLMs), integrating both point cloud and language, presents significant challenges. The lack of a comprehensive assessment hampers determining whether these models truly represent advancements, thereby impeding further progress in the field. Current evaluations heavily rely on classification and caption tasks, falling short in providing a thorough assessment of MLLMs. A pressing need exists for a more sophisticated evaluation method capable of thoroughly analyzing the spatial understanding and expressive capabilities of these models. To address these issues, we introduce a scalable 3D benchmark, accompanied by a large-scale instruction-tuning dataset known as 3DBench, providing an extensible platform for a comprehensive evaluation of MLLMs. Specifically, we establish the benchmark that spans a wide range of spatial and semantic scales, from object-level to scene-level, addressing both perception and planning tasks. Furthermore, we present a rigorous pipeline for automatically constructing scalable 3D instruction-tuning datasets, covering 10 diverse multi-modal tasks with more than 0.23 million QA pairs generated in total. Thorough experiments evaluating trending MLLMs, comparisons against existing datasets, and variations of training protocols demonstrate the superiority of 3DBench, offering valuable insights into current limitations and potential research directions. Codes are available at https://github.com/Inshsang/3DBench.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Scene analysis and understanding   
1758
Self-adaptive Extreme Penalized Loss for Imbalanced Time Series Prediction
Yiyang Wang, Yuchen Han, Yuhan Guo
6 min. talk | August 8th at 15:00 | Session: ML: Time series and data streams
[+] More
[-] Less
Forecasting time series in imbalanced data presents a significant research challenge that requires considerable attention. Although there are specialized techniques available to tackle imbalanced time series prediction, existing approaches tend to prioritize extreme predictions at the expense of compromising the forecasting accuracy of normal samples. We in this paper propose an extreme penalized loss function that relaxes the constraint on overestimating extreme events, thereby imposing great penalties on both normal and underestimating extreme events. In addition, we provide a self-adaptive way for setting the hyperparameters of the loss function. Then, both the proposed loss function and an attention module are integrated with LSTM networks in a decomposition-based framework. Extensive experiments conducted on real-world datasets demonstrate the superiority of our framework compared to other state-of-the-art approaches for both time series prediction and block maxima prediction tasks.
List of keywords
Machine Learning -> ML: Time series and data streams
1762
Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods
Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu
12 min. talk | August 8th at 10:00 | Session: CV: Segmentation
[+] More
[-] Less
Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https://github.com/clovermini/Skea_topo.
List of keywords
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Biomedical image analysis
1768
Unified Single-Stage Transformer Network for Efficient RGB-T Tracking
Jianqiang Xia, Dianxi Shi, Ke Song, Linna Song, Xiaolei Wang, Songchang Jin, Chenran Zhao, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang, Junliang Xing, Jian Zhao
6 min. talk | August 8th at 10:00 | Session: CV: Motion and tracking
[+] More
[-] Less
Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network’s ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone through joint feature extraction, fusion and relation modeling. With this structure, the network can not only extract the fusion features of templates and search regions under the interaction of modalities, but also significantly improve tracking speed through the single-stage fusion tracking paradigm. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for final prediction. Extensive experiments on three mainstream RGB-T tracking benchmarks show that our method achieves the new state-of-the-art while achieving the fastest tracking speed of 84.2FPS. Code is available at https://github.com/xiajianqiang/USTrack.
List of keywords
Computer Vision -> CV: Motion and tracking
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Video analysis and understanding   
1790
WPML3CP: Wasserstein Partial Multi-Label Learning with Dual Label Correlation Perspectives
Ximing Li, Yuanchao Dai, Bing Wang, Changchun Li, Renchu Guan, Fangming Gu, Jihong Ouyang
6 min. talk | August 8th at 10:00 | Session: ML: Weakly supervised learning
[+] More
[-] Less
Partial multi-label learning (PMLL) refers to a weakly-supervised classification problem, where each instance is associated with a set of candidate labels, covering its ground-truth labels but also with irrelevant ones. The current methodology of PMLL is to estimate the ground-truth confidences of candidate labels, i.e., the likelihood of a candidate label being a ground-truth one, and induce the multi-label predictor with them, rather than the candidate labels. In this paper, we aim to estimate precise ground-truth confidences by leveraging precise label correlations, which are also required to estimate. To this end, we propose to capture label correlations from both measuring and modeling perspectives. Specifically, we measure the loss between ground-truth confidences and predictions by employing the Wasserstein distance involving label correlations; and form a label correlation-aware regularization to constraint predictive parameters. The two techniques are coupled to promote precise estimations of label correlations. Upon these ideas, we propose a novel PMLL method, namely Wasserstein Partial Multi-Label Learning with dual Label Correlation Perspectives (WPML3CP). We conduct extensive experiments on several benchmark datasets. Empirical results demonstrate that WPML3CP can outperform the existing PMLL baselines.
List of keywords
Machine Learning -> ML: Weakly supervised learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Self-supervised Learning
1799
FedTAD: Topology-aware Data-free Knowledge Distillation for Subgraph Federated Learning
Yinlin Zhu, Xunkai Li, Zhengyu Wu, Di Wu, Miao Hu, Rong-Hua Li
6 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Subgraph federated learning (subgraph-FL) is a new distributed paradigm that facilitates the collaborative training of graph neural networks (GNNs) by multi-client subgraphs. Unfortunately, a significant challenge of subgraph-FL arises from subgraph heterogeneity, which stems from node and topology variation, causing the impaired performance of the global GNN. Despite various studies, they have not yet thoroughly investigated the impact mechanism of subgraph heterogeneity. To this end, we decouple node and topology variation, revealing that they correspond to differences in label distribution and structure homophily. Remarkably, these variations lead to significant differences in the class-wise knowledge reliability of multiple local GNNs, misguiding the model aggregation with varying degrees. Building on this insight, we propose topology-aware data-free knowledge distillation technology (FedTAD), enhancing reliable knowledge transfer from the local model to the global model. Extensive experiments on six public datasets consistently demonstrate the superiority of FedTAD over state-of-the-art baselines.
List of keywords
Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Federated learning
1803
Enhancing Length Generalization for Attention Based Knowledge Tracing Models with Linear Biases
Xueyi Li, Youheng Bai, Teng Guo, Zitao Liu, Yaying Huang, Xiangyu Zhao, Feng Xia, Weiqi Luo, Jian Weng
6 min. talk | August 6th at 11:30 | Session: MTA: Multidisciplinary Topics and Applications (1/2)
[+] More
[-] Less
Knowledge tracing (KT) is the task of predicting students’ future performance based on their historical learning interaction data. With the rapid advancement of attention mechanisms, many attention based KT models are developed. However, existing attention based KT models exhibit performance drops as the number of student interactions increases beyond the number of interactions on which the KT models are trained. We refer to this as the length generalization of KT model. In this paper, we propose stableKT to enhance length generalization that is able to learn from short sequences and maintain high prediction performance when generalizing on long sequences. Furthermore, we design a multi-head aggregation module to capture the complex relationships between questions and the corresponding knowledge components (KCs) by combining dot-product attention and hyperbolic attention. Experimental results on three public educational datasets show that our model exhibits robust capability of length generalization and outperforms all baseline models in terms of AUC. To encourage reproducible research, we make our data and code publicly available at https://pykt.org.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Education
Humans and AI -> HAI: Computer-aided education
1810
InstructME: An Instruction Guided Music Edit Framework with Latent Diffusion Models
Bing Han, Junyu Dai, Weituo Hao, Xinyan He, Dong Guo, Jitong Chen, Yuxuan Wang, Yanmin Qian, Xuchen Song
6 min. talk | August 9th at 10:00 | Session: MTA: Multidisciplinary Topics and Applications (2/2)
[+] More
[-] Less
Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music’s distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Demo samples are available at https://musicedit.github.io
List of keywords
Multidisciplinary Topics and Applications -> MTA: Arts and creativity
Multidisciplinary Topics and Applications -> MTA: Other
1819
Improving Adversarial Robustness via Feature Pattern Consistency Constraint
Jiacong Hu, Jingwen Ye, Zunlei Feng, Jiazhen Yang, Shunyu Liu, Xiaotian Yu, Lingxiang Jia, Mingli Song
6 min. talk | August 7th at 11:30 | Session: CV: Adversarial learning, adversarial attack and defense methods
[+] More
[-] Less
Convolutional Neural Networks (CNNs) are well-known for their vulnerability to adversarial attacks, posing significant security concerns. In response to these threats, various defense methods have emerged to bolster the model’s robustness. However, most existing methods either focus on learning from adversarial perturbations, leading to overfitting to the adversarial examples, or aim to eliminate such perturbations during inference, inevitably increasing computational burdens. Conversely, clean training, which strengthens the model’s robustness by relying solely on clean examples, can address the aforementioned issues. In this paper, we align with this methodological stream and enhance its generalizability to unknown adversarial examples. This enhancement is achieved by scrutinizing the behavior of latent features within the network. Recognizing that a correct prediction relies on the correctness of the latent feature’s pattern, we introduce a novel and effective Feature Pattern Consistency Constraint (FPCC) method to reinforce the latent feature’s capacity to maintain the correct feature pattern. Specifically, we propose Spatial-wise Feature Modification and Channel-wise Feature Selection to enhance latent features. Subsequently, we employ the Pattern Consistency Loss to constrain the similarity between the feature pattern of the latent features and the correct feature pattern. Our experiments demonstrate that the FPCC method empowers latent features to uphold correct feature patterns even in the face of adversarial examples, resulting in inherent adversarial robustness surpassing state-of-the-art models.
List of keywords
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Recognition (object detection, categorization)
Machine Learning -> ML: Classification
1823
AMO-aware Aggregates in Answer Set Programming
Mario Alviano, Carmine Dodaro, Salvatore Fiorentino, Marco Maratea
6 min. talk | August 7th at 11:30 | Session: KRR: Logic programming
[+] More
[-] Less
Aggregates such as sum and count are among the most frequently used linguistic extensions of Answer Set Programming (ASP). At-most-one (AMO) constraints are a specific form of aggregates that excludes the simultaneous truth of multiple elements in a set. This article unleashes a powerful propagation strategy in case groups of elements in an aggregate are also involved in AMO constraints. In fact, the combined knowledge given by aggregates and AMO constraints significantly increases the effectiveness of search space pruning, resulting in sensible performance gains.
List of keywords
Knowledge Representation and Reasoning -> KRR: Logic programming
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning
1828
Dynamic Brightness Adaptation for Robust Multi-modal Image Fusion
Yiming Sun, Bing Cao, Pengfei Zhu, Qinghua Hu
6 min. talk | August 9th at 10:00 | Session: CV: Applications
[+] More
[-] Less
Infrared and visible image fusion aim to integrate modality strengths for visually enhanced, informative images. Visible imaging in real-world scenarios is susceptible to dynamic environmental brightness fluctuations, leading to texture degradation. Existing fusion methods lack robustness against such brightness perturbations, significantly compromising the visual fidelity of the fused imagery. To address this challenge, we propose the Brightness Adaptive multimodal dynamic fusion framework (BA-Fusion), which achieves robust image fusion despite dynamic brightness fluctuations. Specifically, we introduce a Brightness Adaptive Gate (BAG) module, which is designed to dynamically select features from brightness-related channels for normalization, while preserving brightness-independent structural information within the source images. Furthermore, we propose a brightness consistency loss function to optimize the BAG module. The entire framework is tuned via alternating training strategies. Extensive experiments validate that our method surpasses state-of-the-art methods in preserving multi-modal image information and visual fidelity, while exhibiting remarkable robustness across varying brightness levels. Our code is available: https://github.com/SunYM2020/BA-Fusion.
List of keywords
Computer Vision -> CV: Applications
Computer Vision -> CV: Multimodal learning
1862
Concentration Tail-Bound Analysis of Coevolutionary and Bandit Learning Algorithms
Per Kristian Lehre, Shishen Lin
6 min. talk | August 8th at 10:00 | Session: S: Search
[+] More
[-] Less
Runtime analysis, as a branch of the theory of AI, studies how the number of iterations algorithms take before finding a solution (its runtime) depends on the design of the algorithm and the problem structure. Drift analysis is a state-of-the-art tool for estimating the runtime of randomised algorithms, such as bandit and evolutionary algorithms. Drift refers roughly to the expected progress towards the optimum per iteration. This paper considers the problem of deriving concentration tail-bounds on the runtime of algorithms. It provides a novel drift theorem that gives precise exponential tail-bounds given positive, weak, zero and even negative drift. Previously, such exponential tail bounds were missing in the case of weak, zero, or negative drift. Our drift theorem can be used to prove a strong concentration of the runtime/regret of algorithms in AI. For example, we prove that the regret of the RWAB bandit algorithm is highly concentrated, while previous analyses only considered the expected regret. This means that the algorithm obtains the optimum within a given time frame with high probability, i.e. a form of algorithm reliability. Moreover, our theorem implies that the time needed by the co-evolutionary algorithm RLS-PD to obtain a Nash equilibrium in a Bilinear max-min-benchmark problem is highly concentrated. However, we also prove that the algorithm forgets the Nash equilibrium, and the time until this occurs is highly concentrated. This highlights a weakness in the RLS-PD which should be addressed by future work.
List of keywords
Search -> S: Evolutionary computation
Search -> S: Heuristic search
Search -> S: Other
1872
FactCHD: Benchmarking Fact-Conflicting Hallucination Detection
Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lyu, Dan Zhang, Huajun Chen
6 min. talk | August 9th at 10:00 | Session: NLP: Resources and evaluation
[+] More
[-] Less
Despite their impressive generative capabilities, LLMs are hindered by fact-conflicting hallucinations in real-world applications. The accurate identification of hallucinations in texts generated by LLMs, especially in complex inferential scenarios, is a relatively unexplored area. To address this gap, we present FactCHD, a dedicated benchmark designed for the detection of fact-conflicting hallucinations from LLMs. FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation. A distinctive element of FactCHD is its integration of fact-based evidence chains, significantly enhancing the depth of evaluating the detectors’ explanations. Experiments on different LLMs expose the shortcomings of current approaches in detecting factual errors accurately. Furthermore, we introduce TRUTH-TRIANGULATOR which synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2, aiming to yield more credible detection through the amalgamation of predictive results and evidence.
List of keywords
Natural Language Processing -> NLP: Resources and evaluation
Natural Language Processing -> NLP: Applications
1883
Two-stage Semi-supervised Speaker Recognition with Gated Label Learning
Xingmei Wang, Jiaxiang Meng, Kong Aik Lee, Boquan Li, Jinghan Liu
6 min. talk | August 8th at 10:00 | Session: NLP: Speech
[+] More
[-] Less
Speaker recognition technologies have been successfully applied in diverse domains, benefiting from the advance of deep learning. Nevertheless, current efforts are still subject to the lack of labeled data. Such issues have been attempted in computer vision, through semi-supervised learning (SSL) that assigns pseudo labels for unlabeled data, undertaking the role of labeled ones. Through our empirical evaluations, the state-of-the-art SSL methods show unsatisfactory performance in speaker recognition tasks, due to the imbalance between the quantity and quality of pseudo labels. Therefore, in this work, we propose a two-stage SSL framework, with the aim to address the data scarcity challenge. We first construct an initial contrastive learning network, where the encoder outputs the embedding representation of utterances. Furthermore, we construct an iterative holistic semi-supervised learning network that involves a clustering strategy to assign pseudo labels, and a gated label learning (GLL) strategy to further select reliable pseudo-label data. Systematical evaluations show that our proposed framework achieves superior performance in speaker recognition than the state-of-the-art methods, matching the performance of supervised learning.
List of keywords
Natural Language Processing -> NLP: Speech
Machine Learning -> ML: Semi-supervised learning
1886
A Fast Algorithm for MaxSAT above Half Number of Clauses
Junqiang Peng, Mingyu Xiao
6 min. talk | August 9th at 10:00 | Session: CSO: Satisfiabilty
[+] More
[-] Less
We study the following parameterization of the MaxSAT problem: Given a CNF formula F with m clauses, decide whether at least m/2 + μ clauses in F could be satisfied, where μ is the excess of the number of satisfied clauses over the trivial lower bound m/2 and is taken as the parameter. This perspective is known as the "above guarantee" parameterization. Since its introduction by Mahajan and Raman [1999], the analysis of parameterization above guarantee has become a highly active and fruitful line of research. In this paper, we develop a new algorithm with runtime O*(2.1479^μ), significantly improving the previous best upper bound O*(5.4064^μ) for this important problem. Here, the O* notation omits polynomial factors.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Satisfiabilty
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Constraint satisfaction
Search -> S: Combinatorial search and optimisation
1888
InstructEdit: Instruction-Based Knowledge Editing for Large Language Models
Ningyu Zhang, Bozhong Tian, Siyuan Cheng, Xiaozhuan Liang, Yi Hu, Kouying Xue, Yanjie Gou, Xi Chen, Huajun Chen
6 min. talk | August 9th at 11:30 | Session: NLP: Natural Language Processing (3/3)
[+] More
[-] Less
Knowledge editing for large language models can offer an efficient solution to alter a model’s behavior without negatively impacting the overall performance. However, the current approaches encounter issues with limited generalizability across tasks, necessitating one distinct editor for each task, significantly hindering the broader applications. To address this, we take the first step to analyze the multi-task generalization issue in knowledge editing. Specifically, we develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor’s adaptation to various task performances simultaneously using simple instructions. With only one unified editor for each LLM, we empirically demonstrate that InstructEdit can improve the editor’s control, leading to an average 14.86% increase in Reliability in multi-task editing setting. Furthermore, experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines. To further investigate the underlying mechanisms of instruction-based knowledge editing, we analyze the principal components of the editing gradient directions, which unveils that instructions can help control optimization direction with stronger OOD generalization.
List of keywords
Natural Language Processing -> NLP: Language models
Natural Language Processing -> NLP: Applications
1919
Subgraph Pooling: Tackling Negative Transfer on Graphs
Zehong Wang, Zheyuan Zhang, Chuxu Zhang, Yanfang Ye
6 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Transfer learning aims to enhance performance on a target task by using knowledge from related tasks. However, when the source and target tasks are not closely aligned, it can lead to reduced performance, known as negative transfer. Unlike in image or text data, we find that negative transfer could commonly occur in graph-structured data, even when source and target graphs have semantic similarities. Specifically, we identify that structural differences significantly amplify the dissimilarities in the node embeddings across graphs. To mitigate this, we bring a new insight in this paper: for semantically similar graphs, although structural differences lead to significant distribution shift in node embeddings, their impact on subgraph embeddings could be marginal. Building on this insight, we introduce Subgraph Pooling (SP) by aggregating nodes sampled from a k-hop neighborhood and Subgraph Pooling++ (SP++) by a random walk, to mitigate the impact of graph structural differences on knowledge transfer. We theoretically analyze the role of SP in reducing graph discrepancy and conduct extensive experiments to evaluate its superiority under various settings. The proposed SP methods are effective yet elegant, which can be easily applied on top of any backbone Graph Neural Networks (GNNs). Our code and data are available at: https://github.com/Zehong-Wang/Subgraph-Pooling.
List of keywords
Machine Learning -> ML: Sequence and graph learning
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Semi-supervised learning
1920
Game Transformations That Preserve Nash Equilibria or Best-Response Sets
Emanuel Tewolde, Vincent Conitzer
6 min. talk | August 8th at 15:00 | Session: GTEP: Game Theory and Economic Paradigms
[+] More
[-] Less
In this paper, we investigate under which conditions normal-form games are (guaranteed to be) strategically equivalent. First, we show for N -player games (N >= 3) that (A) it is NP-hard to decide whether a given strategy is a best response to some strategy profile of the opponents, and that (B) it is co-NP-hard to decide whether two games have the same best-response sets. Combining that with known results from the literature, we move our attention to equivalence-preserving game transformations. It is a widely used fact that a positive affine (linear) transformation of the utility payoffs neither changes the best-response sets nor the Nash equilibrium set. We investigate which other game transformations also possess either of the following two properties when being applied to an arbitrary N-player game (N >= 2): (i) The Nash equilibrium set stays the same; (ii) The best-response sets stay the same. For game transformations that operate player-wise and strategy-wise, we prove that (i) implies (ii) and that transformations with property (ii) must be positive affine. The resulting equivalence chain highlights the special status of positive affine transformations among all the transformation procedures that preserve key game-theoretic characteristics.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
Game Theory and Economic Paradigms -> General
1944
Eliminating the Cross-Domain Misalignment in Text-guided Image Inpainting
Muqi Huang, Chaoyue Wang, Yong Luo, Lefei Zhang
12 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Text-guided image inpainting has rapidly garnered prominence as a task in user-directed image synthesis, aiming to complete the occluded image regions following the textual prompt provided. However, current methods usually grapple with issues arising from the disparity between low-level pixel data and high-level semantic descriptions, which results in inpainted sections not harmonizing with the original image (either structurally or texturally). In this study, we introduce a Structure-Aware Inpainting Learning scheme and an Asymmetric Cross Domain Attention to address these cross-domain misalignment challenges. The proposed structure-aware learning scheme employs features of an intermediate modality as structure guidance to bridge the gap between text information and low-level pixels. Meanwhile, asymmetric cross-domain attention enhances the texture consistency between inpainted and unmasked regions. Our experiments show exceptional performance on leading datasets such as MS-COCO and Open Images, surpassing state-of-the-art text-guided image inpainting methods. Code is released at: https://github.com/MucciH/ECDM-inpainting
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Vision, language and reasoning
1946
Fraud Risk Mitigation in Real-Time Payments: A Strategic Agent-Based Analysis
Katherine Mayo, Nicholas Grabill, Michael P. Wellman
12 min. talk | August 9th at 10:00 | Session: MAS: Applications
[+] More
[-] Less
Whereas standard financial mechanisms for payment may take days to finalize, real-time payments (RTPs) provide immediate processing and final receipt of funds. The speed of settlement benefits customers, but raises vulnerability to fraud. We seek to understand how bank nodes may strategically mitigate fraud risk in RTPs, through investment in fraud detection and restricting payments eligible for real-time processing. To study this, we introduce an agent-based model of the payment network supporting both real-time and standard payments, and define a game among banks and fraudsters. Using empirical game-theoretic analysis, we identify Nash equilibria in nine game configurations defined by network attributes. Our analysis finds that as banks become more liable for fraud, they continue to allow RTPs but are more likely to employ both restrictions and a high level of fraud detection. Fraudsters, in response, switch from targeting only RTPs to attempting fraud with any type of payment and tend to exploit banks where they have historically been most successful. We also conduct a strategic feature gains assessment to further understand the benefit offered by each of the bank’s risk mitigation measures, which confirms the importance of selective RTP restrictions. Finally, we find that in equilibrium bank strategic decisions negatively affect fraudsters while minimally impacting customers.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Applications
Multidisciplinary Topics and Applications -> MTA: Economics
Multidisciplinary Topics and Applications -> MTA: Finance
1949
Concept-Level Causal Explanation Method for Brain Function Network Classification
Jinduo Liu, Feipeng Wang, Junzhong Ji
6 min. talk | August 7th at 15:00 | Session: HAI: Humans and AI
[+] More
[-] Less
Using deep models to classify brain functional networks (BFNs) for the auxiliary diagnosis and treatment of brain diseases has become increasingly popular. However, the unexplainability of deep models has seriously hindered their applications in computer-aided diagnosis. In addition, current explanation methods mostly focus on natural images, which cannot be directly used to explain the deep model for BFN classification. In this paper, we propose a concept-level causal explanation method for BFN classification called CLCEM. First, CLCEM employs the causal learning method to extract concepts that are meaningful to humans from BFNs. Second, it aggregates the same concepts to obtain the contribution of each concept to the model output. Finally, CLCEM adds the contribution of each concept to make a diagnosis. The experimental results show that our CLCEM can not only accurately identify brain regions related to specific brain diseases but also make decisions based on the concepts of these brain regions, which enables humans to understand the decision-making process without performance degradation.
List of keywords
Humans and AI -> HAI: Brain sciences
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
Data Mining -> DM: Networks
Machine Learning -> ML: Causality
1951
Domain-Hierarchy Adaptation via Chain of Iterative Reasoning for Few-shot Hierarchical Text Classification
Ke Ji, Peng Wang, Wenjun Ke, Guozheng Li, Jiajun Liu, Jingsheng Gao, Ziyu Shang
6 min. talk | August 8th at 11:30 | Session: NLP: Applications
[+] More
[-] Less
Recently, various pre-trained language models (PLMs) have been proposed to prove their impressive performances on a wide range of few-shot tasks. However, limited by the unstructured prior knowledge in PLMs, it is difficult to maintain consistent performance on complex hierarchically dependent tasks, especially when the downstream data is extremely scarce. The main challenge is how to transfer the unstructured semantic space in PLMs to the downstream domain hierarchy. Unlike previous work on hierarchical text classification (HTC) which directly performs multi-label classification or uses graph neural network (GNN) to inject label hierarchy, in this work, we study the HTC problem under a few-shot setting to adapt knowledge in PLMs from an unstructured manner to the downstream hierarchy. Technically, we design a simple yet effective method named Hierarchical Iterative Conditional Random Field (HierICRF) to search the most domain-challenging directions and exquisitely crafts domain-hierarchy adaptation as a hierarchical iterative language modeling problem, and then it encourages the model to make hierarchical consistency self-correction during the inference, thereby achieving knowledge transfer with hierarchical consistency preservation. We perform HierICRF on various architectures, and extensive experiments on two popular HTC datasets demonstrate that prompt with HierICRF significantly boosts the few-shot HTC performance with an average Micro-F1 by 28.80% to 1.50% and Macro-F1 by 36.29% to 1.5% over the previous state-of-the-art (SOTA) baselines under few-shot settings (1->16), while remaining SOTA hierarchical consistency performance.
List of keywords
Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Text classification
1958
Feedback-Based Adaptive Crossover-Rate in Evolutionary Computation
Xiaoyuan Guan, Tianyi Yang, Chunliang Zhao, Yuren Zhou
6 min. talk | August 8th at 11:30 | Session: S: Evolutionary computation
[+] More
[-] Less
We propose a novel approach to improve multi-objective evolutionary algorithms by modifying crossover operations. Our approach uses a modifiable cross distribution and virtual point to rebalance the probability distribution of all crossover options. This design reduces runtime for typical pseudo-Boolean functions. Experiments and analysis show our approach effectively optimizes bi-objective problems COCZ and LOTZ in Θ(n) time during crossover, outperforming conventional crossover multi-objective evolutionary algorithms (C-MOEA) which require O(n log n) steps. For the tri-objective problem Hierarchical-COCZ, our approach guarantees an expected runtime of Θ(n2 log n), while C-MOEA needs at least Ω(n2 log n) and at most O(n2 log2 n) steps.
List of keywords
Search -> S: Evolutionary computation
Machine Learning -> ML: Evolutionary learning
1959
Cross-View Diversity Embedded Consensus Learning for Multi-View Clustering
Chong Peng, Kai Zhang, Yongyong Chen, Chenglizhao Chen, Qiang Cheng
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
Multi-view clustering (MVC) has garnered significant attention in recent studies. In this paper, we propose a novel MVC method, named CCL-MVC. The novel method constructs a cross-order neighbor tensor of multi-view data to recover a low-rank essential tensor, preserves noise-free, comprehensive, and complementary cross-order relationships among the samples. Furthermore, it constructs a consensus representation matrix by fusing the low-rank essential tensor with auto-adjusted cross-view diversity embedding, fully exploiting both consensus and discriminative information of the data. An effective optimization algorithm is developed, which is theoretically guaranteed to converge. Extensive experimental results confirm the effectiveness of the proposed method.
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
1967
On the Logic of Theory Change Iteration of KM-Update, Revised
Liangda Fang, Tong Zhu, Quanlong Guan, Junming Qiu, Zhao-Rong Lai, Weiqi Luo, Hai Wan
6 min. talk | August 8th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (2/2)
[+] More
[-] Less
Belief revision and update, two significant types of belief change, both focus on how an agent modifies her beliefs in presence of new information. The most striking difference between them is that the former studies the change of beliefs in a static world while the latter concentrates on a dynamically-changing world. The famous AGM and KM postulates were proposed to capture rational belief revision and update, respectively. However, both of them are too permissive to exclude some unreasonable changes in the iteration. In response to this weakness, the DP postulates and its extensions for iterated belief revision were presented. Furthermore, Ferme and Goncalves integrated these postulates in belief update. Unfortunately, some redundant components are included in the definitions of belief states and the faithful assignments for semantic characterizations. Moreover, their approach does not meet the desired property of iterated belief update. They also do not discuss the rationale of any DP postulate within the update context. This paper is intended to fix these deficiencies of Ferme and Goncalves’s approach. Firstly, we present a modification of the original KM postulates based on belief states, and propose the notion of faithful collective assignments of belief states to partial preorders. Subsequently, we migrate several well-known postulates for iterated belief revision to iterated belief update. Moreover, we provide the exact semantic characterizations based on partial preorders for each of the proposed postulates. Finally, we analyze the compatibility between the above iterated postulates and the KM postulates for belief update.
List of keywords
Knowledge Representation and Reasoning -> KRR: Belief change
Knowledge Representation and Reasoning -> KRR: Reasoning about knowledge and belief
1975
PACIA: Parameter-Efficient Adapter for Few-Shot Molecular Property Prediction
Shiguang Wu, Yaqing Wang, Quanming Yao
6 min. talk | August 6th at 11:30 | Session: ML: Applications
[+] More
[-] Less
Molecular property prediction (MPP) plays a crucial role in biomedical applications, but it often encounters challenges due to a scarcity of labeled data. Existing works commonly adopt gradient-based strategy to update a large amount of parameters for task-level adaptation. However, the increase of adaptive parameters can lead to overfitting and poor performance. Observing that graph neural network (GNN) performs well as both encoder and predictor, we propose PACIA, a parameter-efficient GNN adapter for few-shot MPP. We design a unified adapter to generate a few adaptive parameters to modulate the message passing process of GNN. We then adopt a hierarchical adaptation mechanism to adapt the encoder at task-level and the predictor at query-level by the unified GNN adapter. Extensive results show that PACIA obtains the state-of-the-art performance in few-shot MPP problems, and our proposed hierarchical adaptation mechanism is rational and effective.
List of keywords
Machine Learning -> ML: Applications
Machine Learning -> ML: Few-shot learning
1982
Instance-Level Metalearning for Outlier Detection
Long Vu, Peter Kirchner, Charu C. Aggarwal, Horst Samulowitz
6 min. talk | August 8th at 11:30 | Session: DM: Anomaly/outlier detection
[+] More
[-] Less
A machine learning task can be viewed as a sequential pipeline of different algorithmic choices, including data preprocessing, model selection, and hyper-parameter tuning. Automated machine learning selects this sequence in an automated manner. While such approaches are natural in supervised settings, they remain challenging for unsupervised tasks such as outlier detection because of the lack of availability of label-centric feedback. In this paper, we present an instance-level metalearning approach for outlier detection. This approach learns how outlier instances are related to normal points in many labeled data sets to create a supervised meta-model. This meta-model is then used on a new (unlabeled) data set to predict outliers. We show the robustness of our approach on several benchmarks from the OpenML repository.
List of keywords
Data Mining -> DM: Anomaly/outlier detection
Machine Learning -> ML: Automated machine learning
Machine Learning -> ML: Meta-learning
1993
Natural Language-centered Inference Network for Multi-modal Fake News Detection
Qiang Zhang, Jiawei Liu, Fanrui Zhang, Jingyi Xie, Zheng-Jun Zha
6 min. talk | August 7th at 15:00 | Session: DM: Data Mining (1/2)
[+] More
[-] Less
The proliferation of fake news with image and text in the internet has triggered widespread concern. Existing research has made important contributions in cross-modal information interaction and fusion, but fails to fundamentally address the modality gap among news image, text, and news-related external knowledge representations. In this paper, we propose a novel Natural Language-centered Inference Network (NLIN) for multi-modal fake news detection by aligning multi-modal news content with the natural language space and introducing an encoder-decoder architecture to fully comprehend the news in-context. Specifically, we first unify multi-modal news content into textual modality by converting news images and news-related external knowledge into plain textual content. Then, we design a multi-modal feature reasoning module, which consists of a multi-modal encoder, a unified-modal context encoder and an inference decoder with prompt phrase. This framework not only fully extracts the latent representation of cross-modal news content, but also utilizes the prompt phrase to stimulate the powerful in-context learning ability of the pre-trained large language model to reason about the truthfulness of the news content. In addition, to support the research in the field of multi-modal fake news detection, we produce a challenging large scale, multi-platform, multi-domain multi-modal Chinese Fake News Detection (CFND) dataset. Extensive experiments show that our CFND dataset is challenging and the proposed NLIN outperforms state-of-the-art methods.
List of keywords
Data Mining -> DM: Mining text, web, social media
Multidisciplinary Topics and Applications -> MTA: News and media
1997
Kernel Readout for Graph Neural Networks
Jiajun Yu, Zhihao Wu, Jinyu Cai, Adele Lu Jia, Jicong Fan
6 min. talk | August 9th at 11:30 | Session: DM: Mining graphs (3/3)
[+] More
[-] Less
Graph neural networks (GNNs) for graph classification or representation learning require a pooling operation to convert the nodes’ embeddings of each graph to a vector as the graph-level representation and the operation has a significant impact on model accuracy. The paper presents a novel graph pooling method called Kernel Readout (KerRead). KerRead maps the node embeddings from the sample space with limited nodes to an augmented sample space with infinite nodes, and then calculates the inner product between some learnable adaptive centers and the augmented node embeddings, which forms a final graph-level feature vector. We apply the proposed strategy to six supervised and two unsupervised graph neural networks such as GCN, GIN, GUNet, InfoGraph, and GraphCL, and the experiments on eight benchmark datasets show that the proposed readout outperforms classical pooling methods such as Sum and seven state-of-the-art pooling methods such as SRead and Janossy GRU. Code and Appendix are both available at https://github.com/jiajunCAU/KerRead.
List of keywords
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Representation learning
2005
When Fairness Meets Privacy: Exploring Privacy Threats in Fair Binary Classifiers via Membership Inference Attacks
Huan Tian, Guangsheng Zhang, Bo Liu, Tianqing Zhu, Ming Ding, Wanlei Zhou
6 min. talk | August 8th at 11:30 | Session: ETF: AI Ethics, Trust, Fairness (2/2)
[+] More
[-] Less
While in-processing fairness approaches show promise in mitigating bias predictions, their potential impact on privacy leakage remains under-explored. We aim to address this gap by assessing the privacy risks of fairness-enhanced binary classifiers with membership inference attacks (MIAs). Surprisingly, our results reveal that these fairness interventions exhibit increased resilience against existing attacks, indicating that enhancing fairness does not necessarily lead to privacy compromises. However, we find current attack methods are ineffective as they typically degrade into simple threshold models with limited attack effectiveness. Following this observation, we discover a novel threat dubbed Fairness Discrepancy Membership Inference Attacks (FD-MIA) that exploits prediction discrepancies between fair and biased models. This attack reveals more potent vulnerabilities and poses significant privacy risks to model privacy. Extensive experiments across multiple datasets, attack methods, and representative fairness approaches confirm our findings and demonstrate the efficacy of the proposed attack method. Our study exposes the overlooked privacy threats in fairness studies, advocating for thorough evaluations of potential security vulnerabilities before model deployments.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
2006
Ten Words Only Still Help: Improving Black-Box AI-Generated Text Detection via Proxy-Guided Efficient Re-Sampling
Yuhui Shi, Qiang Sheng, Juan Cao, Hao Mi, Beizhe Hu, Danding Wang
6 min. talk | August 9th at 10:00 | Session: ETF: Trustworthy AI
[+] More
[-] Less
With the rapidly increasing application of large language models (LLMs), their abuse has caused many undesirable societal problems such as fake news, academic dishonesty, and information pollution. This makes AI-generated text (AIGT) detection of great importance. Among existing methods, white-box methods are generally superior to black-box methods in terms of performance and generalizability, but they require access to LLMs’ internal states and are not applicable to black-box settings. In this paper, we propose to estimate word generation probabilities as pseudo white-box features via multiple re-sampling to help improve AIGT detection under the black-box setting. Specifically, we design POGER, a proxy-guided efficient re-sampling method, which selects a small subset of representative words (e.g., 10 words) for performing multiple re-sampling in black-box AIGT detection. Experiments on datasets containing texts from humans and seven LLMs show that POGER outperforms all baselines in macro F1 under black-box, partial white-box, and out-of-distribution settings and maintains lower re-sampling costs than its existing counterparts.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Natural Language Processing -> NLP: Applications
2007
SDformer: Transformer with Spectral Filter and Dynamic Attention for Multivariate Time Series Long-term Forecasting
Ziyu Zhou, Gengyu Lyu, Yiming Huang, Zihao Wang, Ziyu Jia, Zhen Yang
12 min. talk | August 8th at 15:00 | Session: ML: Time series and data streams
[+] More
[-] Less
Transformer has gained widespread adoption in modeling time series due to the exceptional ability of its self-attention mechanism in capturing long-range dependencies. However, when processing time series data with numerous variates, the vanilla self-attention mechanism tends to distribute attention weights evenly and smoothly, causing row-homogenization in attention maps and further hampering time series forecasting. To tackle this issue, we propose an advanced Transformer architecture entitled SDformer, which designs two novel modules, Spectral-Filter-Transform (SFT) and Dynamic-Directional-Attention (DDA), and integrates them into the encoder of Transformer to achieve more intensive attention allocation. Specifically, the SFT module utilizes the Fast Fourier Transform to select the most prominent frequencies, along with a Hamming Window to smooth and denoise the filtered series data; The DDA module applies a specialized kernel function to the query and key vectors projected from the denoised data, concentrating this innovative attention mechanism more effectively on the most informative variates to obtain a sharper attention distribution. These two modules jointly enable attention weights to be more salient among numerous variates, which in turn enhances the attention’s ability to capture multivariate correlations, improving the performance in forecasting. Extensive experiments on public datasets demonstrate its superior performance over other state-of-the-art models. Code is available at https://github.com/zhouziyu02/SDformer.
List of keywords
Machine Learning -> ML: Time series and data streams
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Attention models
2024
VCC-INFUSE: Towards Accurate and Efficient Selection of Unlabeled Examples in Semi-supervised Learning
Shijie Fang, Qianhan Feng, Tong Lin
6 min. talk | August 8th at 11:30 | Session: ML: Semi-supervised learning
[+] More
[-] Less
Despite the progress of Semi-supervised Learning (SSL), existing methods fail to utilize unlabeled data effectively and efficiently. Many pseudo-label-based methods select unlabeled examples based on inaccurate confidence scores from the classifier. Most prior work also uses all available unlabeled data without pruning, making it difficult to handle large amounts of unlabeled data. To address these issues, we propose two methods: Variational Confidence Calibration (VCC) and Influence-Function-based Unlabeled Sample Elimination (INFUSE). VCC is a universal plugin for SSL confidence calibration, using a variational autoencoder to select more accurate pseudo labels based on three types of consistency scores. INFUSE is a data pruning method that constructs a core dataset of unlabeled examples under SSL. Our methods are effective in multiple datasets and settings, reducing classification error rates and saving training time. Together, VCC-INFUSE reduces the error rate of FlexMatch on the CIFAR-100 dataset by 1.08% while saving nearly half of the training time.
List of keywords
Machine Learning -> ML: Semi-supervised learning
2026
Balancing Multimodal Learning via Online Logit Modulation
Daoming Zong, Chaoyue Ding, Baoxiang Li, Jiakui Li, Ken Zheng
6 min. talk | August 9th at 10:00 | Session: ML: Optimization
[+] More
[-] Less
Multimodal learning is provably superior to unimodal learning. However, in practice, the best-performing unimodal networks often outperform jointly trained multimodal networks. This phenomenon can be attributed to the varying convergence and generalization rates across different modalities, leading to the dominance of one modality and causing underfitting of other modalities in simple multimodal joint training. To mitigate this issue, we propose two key ingredients: i) disentangling the learning of unimodal features and multimodal interaction through an intermediate representation fusion block; ii) modulating the logits of different modalities via dynamic coefficients during training to align their magnitudes with the target values, referred to as online logit modulation (OLM). Remarkably, OLM is model-agnostic and can be seamlessly integrated with most existing multimodal training frameworks. Empirical evidence shows that our approach brings significant enhancements over baselines on a wide range of multimodal tasks, covering video, audio, text, image, and depth modalities.
List of keywords
Machine Learning -> ML: Optimization
Computer Vision -> CV: Multimodal learning
Machine Learning -> ML: Applications
Machine Learning -> ML: Attention models
2029
PTDE: Personalized Training with Distilled Execution for Multi-Agent Reinforcement Learning
Yiqun Chen, Hangyu Mao, Jiaxin Mao, Shiguang Wu, Tianle Zhang, Bin Zhang, Wei Yang, Hongxing Chang
6 min. talk | August 7th at 15:00 | Session: MAS: Multi-agent learning
[+] More
[-] Less
Centralized Training with Decentralized Execution (CTDE) has emerged as a widely adopted paradigm in multi-agent reinforcement learning, emphasizing the utilization of global information for learning an enhanced joint Q-function or centralized critic. In contrast, our investigation delves into harnessing global information to directly enhance individual Q-functions or individual actors. Notably, we discover that applying identical global information universally across all agents proves insufficient for optimal performance. Consequently, we advocate for the customization of global information tailored to each agent, creating agent-personalized global information to bolster overall performance. Furthermore, we introduce a novel paradigm named Personalized Training with Distilled Execution (PTDE), wherein agent-personalized global information is distilled into the agent’s local information. This distilled information is then utilized during decentralized execution, resulting in minimal performance degradation. PTDE can be seamless integrated with state-of-the-art algorithms, leading to notable performance enhancements across diverse benchmarks, including the SMAC benchmark, Google Research Football (GRF) benchmark, and Learning to Rank (LTR) task.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
2034
HeterGCL: Graph Contrastive Learning Framework on Heterophilic Graph
Chenhao Wang, Yong Liu, Yan Yang, Wei Li
6 min. talk | August 6th at 15:00 | Session: DM: Mining graphs (2/3)
[+] More
[-] Less
Graph Contrastive Learning (GCL) has attracted significant research attention due to its self-supervised ability to learn robust node representations. Unfortunately, most methods primarily focus on homophilic graphs, rendering them less effective for heterophilic graphs. In addition, the complexity of node interactions in heterophilic graphs poses considerable challenges to augmentation schemes, coding architectures, and contrastive designs for traditional GCL. In this work, we propose HeterGCL, a novel graph contrastive learning framework with structural and semantic learning to explore the true potential of GCL on heterophilic graphs. Specifically, We abandon the random augmentation scheme that leads to the destruction of the graph structure, instead introduce an adaptive neighbor aggregation strategy (ANA) to extract topology-supervised signals from neighboring nodes at different distances and explore the structural information with an adaptive local-to-global contrastive loss. In the semantic learning module, we jointly consider the original nodes’ features and the similarity between nodes in the latent feature space to explore hidden associations between nodes. Experimental results on homophilic and heterophilic graphs demonstrate that HeterGCL outperforms existing self-supervised and semi-supervised baselines across various downstream tasks.
List of keywords
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Self-supervised Learning
2043
Recall, Retrieve and Reason: Towards Better In-Context Relation Extraction
Guozheng Li, Peng Wang, Wenjun Ke, Yikai Guo, Ke Ji, Ziyu Shang, Jiajun Liu, Zijie Xu
12 min. talk | August 7th at 11:30 | Session: NLP: Information extraction
[+] More
[-] Less
Relation extraction (RE) aims to identify relations between entities mentioned in texts. Although large language models (LLMs) have demonstrated impressive in-context learning (ICL) abilities in various tasks, they still suffer from poor performances compared to most supervised fine-tuned RE methods. Utilizing ICL for RE with LLMs encounters two challenges: (1) retrieving good demonstrations from training examples, and (2) enabling LLMs exhibit strong ICL abilities in RE. On the one hand, retrieving good demonstrations is a non-trivial process in RE, which easily results in low relevance regarding entities and relations. On the other hand, ICL with an LLM achieves poor performance in RE while RE is different from language modeling in nature or the LLM is not large enough. In this work, we propose a novel recall-retrieve-reason RE framework that synergizes LLMs with retrieval corpora (training examples) to enable relevant retrieving and reliable in-context reasoning. Specifically, we distill the consistently ontological knowledge from training datasets to let LLMs generate relevant entity pairs grounded by retrieval corpora as valid queries. These entity pairs are then used to retrieve relevant training examples from the retrieval corpora as demonstrations for LLMs to conduct better ICL via instruction tuning. Extensive experiments on different LLMs and RE datasets demonstrate that our method generates relevant and valid entity pairs and boosts ICL abilities of LLMs, achieving competitive or new state-of-the-art performance on sentence-level RE compared to previous supervised fine-tuning methods and ICL-based methods.
List of keywords
Natural Language Processing -> NLP: Information extraction
2045
ReliaAvatar: A Robust Real-Time Avatar Animator with Integrated Motion Prediction
Bo Qian, Zhenhuan Wei, Jiashuo Li, Xing Wei
6 min. talk | August 7th at 15:00 | Session: HAI: Humans and AI
[+] More
[-] Less
Efficiently estimating the full-body pose with minimal wearable devices presents a worthwhile research direction. Despite significant advancements in this field, most current research neglects to explore full-body avatar estimation under low-quality signal conditions, which is prevalent in practical usage. To bridge this gap, we summarize three scenarios that may be encountered in real-world applications: standard scenario, instantaneous data-loss scenario, and prolonged data-loss scenario, and propose a new evaluation benchmark. The solution we propose to address data-loss scenarios is integrating the full-body avatar pose estimation problem with motion prediction. Specifically, we present ReliaAvatar, a real-time, reliable avatar animator equipped with predictive modeling capabilities employing a dual-path architecture. ReliaAvatar operates effectively, with an impressive performance rate of 109 frames per second (fps). Extensive comparative evaluations on widely recognized benchmark datasets demonstrate ReliaAvatar’s superior performance in both standard and low data-quality conditions. The code is available at https://github.com/MIV-XJTU/ReliaAvatar.
List of keywords
Humans and AI -> HAI: Applications
Humans and AI -> HAI: Human-computer interaction
Humans and AI -> HAI: Personalization and user modeling
Robotics -> ROB: Human robot interaction
2051
Allocating Mixed Goods with Customized Fairness and Indivisibility Ratio
Bo Li, Zihao Li, Shengxin Liu, Zekai Wu
6 min. talk | August 8th at 10:00 | Session: GTEP: Fair division
[+] More
[-] Less
We consider the problem of fairly allocating a combination of divisible and indivisible goods. While fairness criteria like envy-freeness (EF) and proportionality (PROP) can always be achieved for divisible goods, only their relaxed versions, such as the “up to one” relaxations EF1 and PROP1, can be satisfied when the goods are indivisible. The “up to one” relaxations require the fairness conditions to be satisfied provided that one good can be completely eliminated or added in the comparison. In this work, we bridge the gap between the two extremes and propose “up to a fraction” relaxations for the allocation of mixed divisible and indivisible goods. The fraction is determined based on the proportion of indivisible goods, which we call the indivisibility ratio. The new concepts also introduce asymmetric conditions that are customized for individuals with varying indivisibility ratios. We provide both upper and lower bounds on the fractions of the modified item in order to satisfy the fairness criterion. Our results are tight up to a constant for EF and asymptotically tight for PROP.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Fair division
2053
Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control
Ka-Ho Chow, Wenqi Wei, Lei Yu
6 min. talk | August 7th at 11:30 | Session: CV: Adversarial learning, adversarial attack and defense methods
[+] More
[-] Less
Natural language processing (NLP) has received unprecedented attention. While advancements in NLP models have led to extensive research into their backdoor vulnerabilities, the potential for these advancements to introduce new backdoor threats remains unexplored. This paper proposes Imperio, which harnesses the language understanding capabilities of NLP models to enrich backdoor attacks. Imperio provides a new model control experience. Demonstrated through controlling image classifiers, it empowers the adversary to manipulate the victim model with arbitrary output through language-guided instructions. This is achieved using a language model to fuel a conditional trigger generator, with optimizations designed to extend its language understanding capabilities to backdoor instruction interpretation and execution. Our experiments across three datasets, five attacks, and nine defenses confirm Imperio’s effectiveness. It can produce contextually adaptive triggers from text descriptions and control the victim model with desired outputs, even in scenarios not encountered during training. The attack reaches a high success rate without compromising the accuracy of clean inputs and exhibits resilience against representative defenses. Supplementary materials are available at https://khchow.com/Imperio.
List of keywords
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
2055
Cross-View Contrastive Fusion for Enhanced Molecular Property Prediction
Yan Zheng, Song Wu, Junyu Lin, Yazhou Ren, Jing He, Xiaorong Pu, Lifang He
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
Machine learning based molecular property prediction has been a hot topic in the field of computer aided drug discovery (CADD). However, current MPP methods face two prominent challenges: 1) single-view MPP methods do not sufficiently exploit the complementary information of molecular data across multiple views, generally producing suboptimal performance, and 2) most existing multi-view MPP methods ignore the disparities in data quality among different views, inadvertently introducing the risk of models being overshadowed by inferior views. To address the above challenges, we introduce a novel cross-view contrastive fusion for enhanced molecular property prediction method (MolFuse). First, we extract intricate molecular semantics and structures from both sequence and graph views to leverage the complementarity of multi-view data. Then, MolFuse employs two distinct graphs, the atomic graph and chemical bond graph, to enhance the representation of the molecular graph, allow us to integrate both the fundamental backbone attributes and the nuanced shape characteristics. Notably, we incorporate a dual learning mechanism to refine the initial feature representations, and global features are obtained by maximizing the coherence among diverse view-specific molecular representations for the downstream task. The overall learning processes are combined into a unified optimization problem for iterative training. Experiments on multiple benchmark datasets demonstrate the superiority of our MolFuse.
List of keywords
Machine Learning -> ML: Multi-view learning
Data Mining -> DM: Mining graphs
Multidisciplinary Topics and Applications -> MTA: Bioinformatics
2072
Langshaw: Declarative Interaction Protocols Based on Sayso and Conflict
Munindar P. Singh, Samuel H. Christie V., Amit K. Chopra
6 min. talk | August 9th at 11:30 | Session: MAS: Agent-based and Multi-agent Systems (2/2)
[+] More
[-] Less
Current languages for specifying multiagent protocols either over-constrain protocol enactments or complicate capturing their meanings. We propose Langshaw, a declarative protocol language based on (1) sayso, a new construct that captures who has priority over setting each attribute, and (2) nono and nogo, two constructs to capture conflicts between actions. Langshaw combines flexibility with an information model to express meaning. We give a formal semantics for Langshaw, procedures for determining the safety and liveness of a protocol, and a method to generate a message-oriented protocol (embedding needed coordination) suitable for flexible asynchronous enactment.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Agent communication
Agent-based and Multi-agent Systems -> MAS: Engineering methods, platforms, languages and tools
Agent-based and Multi-agent Systems -> MAS: Formal verification, validation and synthesis
Agent-based and Multi-agent Systems -> General
2089
Vulnerabilities of Single-Round Incentive Compatibility in Auto-bidding: Theory and Evidence from ROI-Constrained Online Advertising Markets
Juncheng Li, Pingzhong Tang
6 min. talk | August 8th at 15:00 | Session: GTEP: Game Theory and Economic Paradigms
[+] More
[-] Less
Most of the work in the auction design literature assumes that bidders behave rationally based on the information available for every individual auction, and the revelation principle enables designers to restrict their efforts to incentive compatible (IC) mechanisms. However, in today’s online advertising markets, one of the most important real-life applications of auction design, the data and computational power required to bid optimally are only available to the platform, and an advertiser can only participate by setting performance objectives and constraints for its proxy auto-bidder provided by the platform. The prevalence of auto-bidding necessitates a review of auction theory. In this paper, we examine the markets through the lens of ROI-constrained value-maximizing campaigns. We show that second price auction exhibits many undesirable properties (computational hardness, non-monotonicity, instability of bidders’ utilities, and interference in A/B testing) and loses its dominant theoretical advantages in single-item scenarios. In addition, we make it clear how IC and its runner-up-winner interdependence contribute to each property. We hope that our work could bring new perspectives to the community and benefit practitioners to attain a better grasp of real-world markets.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Multidisciplinary Topics and Applications -> MTA: Economics
2093
Large Language Model-Enhanced Algorithm Selection: Towards Comprehensive Algorithm Representation
Xingyu Wu, Yan Zhong, Jibin Wu, Bingbing Jiang, Kay Chen Tan
12 min. talk | August 6th at 11:30 | Session: ML: Applications
[+] More
[-] Less
Algorithm selection, a critical process of automated machine learning, aims to identify the most suitable algorithm for solving a specific problem prior to execution. Mainstream algorithm selection techniques heavily rely on problem features, while the role of algorithm features remains largely unexplored. Due to the intrinsic complexity of algorithms, effective methods for universally extracting algorithm information are lacking. This paper takes a significant step towards bridging this gap by introducing Large Language Models (LLMs) into algorithm selection for the first time. By comprehending the code text, LLM not only captures the structural and semantic aspects of the algorithm, but also demonstrates contextual awareness and library function understanding. The high-dimensional algorithm representation extracted by LLM, after undergoing a feature selection module, is combined with the problem representation and passed to the similarity calculation module. The selected algorithm is determined by the matching degree between a given problem and different algorithms. Extensive experiments validate the performance superiority of the proposed model and the efficacy of each key module. Furthermore, we present a theoretical upper bound on model complexity, showcasing the influence of algorithm representation and feature selection modules. This provides valuable theoretical guidance for the practical implementation of our method.
List of keywords
Machine Learning -> ML: Automated machine learning
Search -> S: Algorithm portfolios and configuration
Machine Learning -> ML: Applications
Natural Language Processing -> NLP: Language models
2119
Seed Selection in the Heterogeneous Moran Process
Petros Petsinis, Andreas Pavlogiannis, Josef Tkadlec, Panagiotis Karras
6 min. talk | August 7th at 15:00 | Session: DM: Data Mining (1/2)
[+] More
[-] Less
The Moran process is a classic stochastic process that models the rise and takeover of novel traits in network-structured populations. In biological terms, a set of mutants, each with fitness m ∈ (0, ∞) invade a population of residents with fitness 1. Each agent reproduces at a rate proportional to its fitness and each offspring replaces a random network neighbor. The process ends when the mutants either fixate (take over the whole population) or go extinct. The fixation probability measures the success of the invasion. To account for environmental heterogeneity, we study a generalization of the Standard process, called the Heterogeneous Moran process. Here, the fitness of each agent is determined both by its type (resident/mutant) and the node it occupies. We study the natural optimization problem of seed selection: given a budget k, which k agents should initiate the mutant invasion to maximize the fixation probability? We show that the problem is strongly inapproximable: it is NP-hard to distinguish between maximum fixation probability 0 and 1. We then focus on mutant-biased networks, where each node exhibits at least as large mutant fitness as resident fitness. We show that the problem remains NP-hard, but the fixation probability becomes submodular, and thus the optimization problem admits a greedy (1 − 1/e)-approximation. An experimental evaluation of the greedy algorithm along with various heuristics on real-world data sets corroborates our results.
List of keywords
Data Mining -> DM: Networks
Agent-based and Multi-agent Systems -> MAS: Resource allocation
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Search -> S: Evolutionary computation
2126
Score-CDM: Score-Weighted Convolutional Diffusion Model for Multivariate Time Series Imputation
Shunyang Zhang, Senzhang Wang, Hao Miao, Hao Chen, Changjun Fan, Jian Zhang
6 min. talk | August 7th at 10:00 | Session: DM: Mining spatial and/or temporal data (1/2)
[+] More
[-] Less
Multivariant time series (MTS) data are usually incomplete in real scenarios, and imputing the incomplete MTS is practically important to facilitate various time series mining tasks. Recently, diffusion model-based MTS imputation methods have achieved promising results by utilizing CNN or attention mechanisms for temporal features learning. However, it is hard to adaptively trade off the diverse effects of local and global temporal features by simply combining CNN and attention. To address this issue, we propose a Score-weighted Convolutional Diffusion Model (Score-CDM for short), whose backbone consists of a Score-weighted Convolution Module (SCM) and an Adaptive Reception Module (ARM). SCM adopts a score map to capture the global temporal features in the time domain, while ARM uses a Spectral2Time Window Block (S2TWB) to convolve the local time series data in the spectral domain. Benefiting from the time convolution properties of Fast Fourier Transformation, ARM can adaptively change the receptive field of the score map, and thus effectively balance the local and global temporal features. We conduct extensive evaluations on three real MTS datasets of different domains, and the result verifies the effectiveness of the proposed Score-CDM.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
2127
Cooperation and Control in Delegation Games
Oliver Sourbut, Lewis Hammond, Harriet Wood
6 min. talk | August 7th at 10:00 | Session: MAS: Coordination and cooperation
[+] More
[-] Less
Many settings of interest involving humans and machines – from virtual personal assistants to autonomous vehicles – can naturally be modelled as principals (humans) delegating to agents (machines), which then interact with each other on their principals’ behalf. We refer to these multi-principal, multi-agent scenarios as delegation games. In such games, there are two important failure modes: problems of control (where an agent fails to act in line their principal’s preferences) and problems of cooperation (where the agents fail to work well together). In this paper we formalise and analyse these problems, further breaking them down into issues of alignment (do the players have similar preferences?) and capabilities (how competent are the players at satisfying those preferences?). We show – theoretically and empirically – how these measures determine the principals’ welfare, how they can be estimated using limited observations, and thus how they might be used to help us design more aligned and cooperative AI systems.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Game Theory and Economic Paradigms -> GTEP: Other
Humans and AI -> HAI: Human-AI collaboration
2128
RealDex: Towards Human-like Grasping for Robotic Dexterous Hand
Yumeng Liu, Yaxun Yang, Youzhuo Wang, Xiaofei Wu, Jiamin Wang, Yichen Yao, Sören Schwertfeger, Sibei Yang, Wenping Wang, Jingyi Yu, Xuming He, Yuexin Ma
6 min. talk | August 6th at 11:30 | Session: ROB: Robotics (1/2)
[+] More
[-] Less
In this paper, we introduce RealDex, a pioneering dataset capturing authentic dexterous hand grasping motions infused with human behavioral patterns, enriched by multi-view and multimodal visual data. Utilizing a teleoperation system, we seamlessly synchronize human-robot hand poses in real time. This collection of human-like motions is crucial for training dexterous hands to mimic human movements more naturally and precisely. RealDex holds immense promise in advancing humanoid robot for automated perception, cognition, and manipulation in real-world scenarios. Moreover, we introduce a cutting-edge dexterous grasping motion generation framework, which aligns with human experience and enhances real-world applicability through effectively utilizing Multimodal Large Language Models. Extensive experiments have demonstrated the superior performance of our method on RealDex and other open datasets. The dataset and associated code are available at https://4dvlab.github.io/RealDex_page/.
List of keywords
Robotics -> ROB: Learning in robotics
Robotics -> ROB: Manipulation
Robotics -> ROB: Robotics and vision
2140
Facility Location Problems with Capacity Constraints: Two Facilities and Beyond
Gennaro Auricchio, Zihe Wang, Jie Zhang
6 min. talk | August 8th at 11:30 | Session: GTEP: Mechanism design
[+] More
[-] Less
In this paper, we investigate the Mechanism Design aspects of the m-Capacitated Facility Location Problem (m-CFLP) on a line. We focus on two frameworks. In the first framework, the number of facilities is arbitrary, all facilities have the same capacity, and the number of agents is equal to the total capacity of all facilities. In the second framework, we aim to place two facilities, each with a capacity of at least half of the total agents. For both of these frameworks, we propose truthful mechanisms with bounded approximation ratios with respect to the Social Cost (SC) and the Maximum Cost (MC). When m>2, the result sharply contrasts with the impossibility results known for the classic m-Facility Location Problem, where capacity constraints are not considered. Furthermore, all our mechanisms are optimal with respect to the MC and optimal or nearly optimal with respect to the SC among anonymous mechanisms. For both frameworks, we provide a lower bound on the approximation ratio that any truthful and deterministic mechanism can achieve with respect to the SC and MC.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Mechanism design
Agent-based and Multi-agent Systems -> MAS: Agent theories and models
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
Agent-based and Multi-agent Systems -> MAS: Resource allocation
2144
Robust Reward Placement under Uncertainty
Petros Petsinis, Kaichen Zhang, Andreas Pavlogiannis, Jingbo Zhou, Panagiotis Karras
6 min. talk | August 9th at 11:30 | Session: PS: Planning and Scheduling (2/2)
[+] More
[-] Less
We consider a problem of placing generators of rewards to be collected by randomly moving agents in a network. In many settings, the precise mobility pattern may be one of several possible, based on parameters outside our control, such as weather conditions. The placement should be robust to this uncertainty, to gain a competent total reward across possible networks. To study such scenarios, we introduce the Robust Reward Placement problem (RRP). Agents move randomly by a Markovian Mobility Model with a predetermined set of locations whose connectivity is chosen adversarially from a known set Π of candidates. We aim to select a set of reward states within a budget that maximizes the minimum ratio, among all candidates in Π, of the collected total reward over the optimal collectable reward under the same candidate. We prove that RRP is NP-hard and inapproximable, and develop Ψ-Saturate, a pseudo-polynomial time algorithm that achieves an ϵ-additive approximation by exceeding the budget constraint by a factor that scales as O(ln |Π|/ϵ). In addition, we present several heuristics, most prominently one inspired by a dynamic programming algorithm for the max–min 0–1 KNAPSACK problem. We corroborate our theoretical analysis with an experimental evaluation on synthetic and real data.
List of keywords
Planning and Scheduling -> PS: Planning under uncertainty
Data Mining -> DM: Networks
Planning and Scheduling -> PS: Markov decisions processes
Search -> S: Combinatorial search and optimisation
2165
A Deep Reinforcement Learning Approach to Balance Viewport Prediction and Video Transmission in 360° Video Streaming
Guanghui Zhang, Jing Guo
6 min. talk | August 6th at 15:00 | Session: MTA: Transportation
[+] More
[-] Less
360° video streaming has seen tremendous growth in past years. However, our measurement reveals a dilemma that severely limits QoE. On the one hand, viewport prediction requires the shortest possible prediction distance for high predicting accuracy; On the other hand, video transmission requires more buffered data to compensate for bandwidth fluctuations otherwise substantial playback rebuffering would be incurred. Since no existing method can break this dilemma, the QoE optimization was naturally bottlenecked. This work tackles this challenge by developing QUTA – a novel learning-based streaming system. Specifically, our measurement shows that three kinds of internal streaming parameters have significant impacts on the prediction distance, namely, download pause, data rate threshold, and playback rate. On top of this, we design a new long-term-planning (LTP) learning method that tunes the parameters dynamically based on the network and streaming context. Evaluations with large-scale streaming trace data show that QUTA not only improves the prediction accuracy and QoE by up to 68.4% but also exhibits strong robustness.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Transportation
2184
Comparing Ways of Obtaining Candidate Orderings from Approval Ballots
Théo Delemazure, Chris Dong, Dominik Peters, Magdalena Tydrichova
6 min. talk | August 6th at 11:30 | Session: GTEP: Computational social choice (1/2)
[+] More
[-] Less
To understand and summarize approval preferences and other binary evaluation data, it is useful to order the items on an axis which explains the data. In a political election using approval voting, this could be an ideological left-right axis such that each voter approves adjacent candidates, an analogue of single-peakedness. In a perfect axis, every approval set would be an interval, which is usually not possible, and so we need to choose an axis that gets closest to this ideal. The literature has developed algorithms for optimizing several objective functions (e.g., minimize the number of added approvals needed to get a perfect axis), but provides little help with choosing among different objectives. In this paper, we take a social choice approach and compare 5 different axis selection rules axiomatically, by studying the properties they satisfy. We establish some impossibility theorems, and characterize (within the class of scoring rules) the rule that chooses the axes that maximize the number of votes that form intervals, using the axioms of ballot monotonicity and resistance to cloning. Finally, we study the behavior of the rules on data from French election surveys, on the votes of justices of the US Supreme Court, and on synthetic data.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2188
Determining Winners in Elections with Absent Votes
Qishen Han, Amelie Marian, Lirong Xia
6 min. talk | August 6th at 11:30 | Session: GTEP: Computational social choice (1/2)
[+] More
[-] Less
An important question in elections is determining whether a candidate can be a winner when some votes are absent. We study this determining winner with absent votes (WAV) problem with elections that take top-truncated ballots. We show that the WAV problem is NP-complete for single transferable vote, Maximin, and Copeland, and propose a special case of positional scoring rule such that the problem can be computed in polynomial time. Our results for top-truncated rankings differ from the results in full rankings as their hardness results still hold when the number of candidates or the number of missing votes are bounded, while we show that the problem can be solved in polynomial time in either case.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2224
NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning
Nathaniel Weir, Peter Clark, Benjamin Van Durme
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
Our goal is to develop a modern approach to answering questions via systematic reasoning where answers are supported by human interpretable proof trees grounded in an NL corpus of facts. Such a system would help alleviate the challenges of interpretability and hallucination with modern LMs, and the lack of grounding of current explanation methods (e.g., Chain-of-Thought). This paper proposes a new take on Prolog-based inference engines, where we replace handcrafted rules with a combination of neural language modeling, guided generation, and semiparametric dense retrieval. Our implementation, NELLIE, is the first system to demonstrate fully interpretable, end-to-end grounded QA as entailment tree proof search, going beyond earlier work explaining known-to-be-true facts from text. In experiments, NELLIE outperforms a similar-sized state-of-the-art reasoner while producing knowledge-grounded explanations. We also find NELLIE can exploit both semi-structured and NL text corpora to guide reasoning. Together these suggest a new way to jointly reap the benefits of both modern neural methods and traditional symbolic reasoning.
List of keywords
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
Knowledge Representation and Reasoning -> KRR: Reasoning about knowledge and belief
Natural Language Processing -> NLP: Question answering
Search -> S: Other
2226
Hypergraph Self-supervised Learning with Sampling-efficient Signals
Fan Li, Xiaoyang Wang, Dawei Cheng, Wenjie Zhang, Ying Zhang, Xuemin Lin
12 min. talk | August 7th at 11:30 | Session: ML: Self-supervised Learning
[+] More
[-] Less
Self-supervised learning (SSL) provides a promising alternative for representation learning on hypergraphs without costly labels. However, existing hypergraph SSL models are mostly based on contrastive methods with the instance-level discrimination strategy, suffering from two significant limitations: (1) They select negative samples arbitrarily, which is unreliable in deciding similar and dissimilar pairs, causing training bias. (2) They often require a large number of negative samples, resulting in expensive computational costs. To address the above issues, we propose SE-HSSL, a hypergraph SSL framework with three sampling-efficient self-supervised signals. Specifically, we introduce two sampling-free objectives leveraging the canonical correlation analysis as the node-level and group-level self-supervised signals. Additionally, we develop a novel hierarchical membership-level contrast objective motivated by the cascading overlap relationship in hypergraphs, which can further reduce membership sampling bias and improve the efficiency of sample utilization. Through comprehensive experiments on 7 real-world hypergraphs, we demonstrate the superiority of our approach over the state-of-the-art method in terms of both effectiveness and efficiency.
List of keywords
Machine Learning -> ML: Self-supervised Learning
Data Mining -> DM: Mining graphs
2228
Predictive Modeling with Temporal Graphical Representation on Electronic Health Records
Jiayuan Chen, Changchang Yin, Yuanlong Wang, Ping Zhang
6 min. talk | August 9th at 11:30 | Session: MTA: Health and medicine
[+] More
[-] Less
Deep learning-based predictive models, leveraging Electronic Health Records (EHR), are receiving increasing attention in healthcare. An effective representation of a patient’s EHR should hierarchically encompass both the temporal relationships between historical visits and medical events, and the inherent structural information within these elements. Existing patient representation methods can be roughly categorized into sequential representation and graphical representation. The sequential representation methods focus only on the temporal relationships among longitudinal visits. On the other hand, the graphical representation approaches, while adept at extracting the graph-structured relationships between various medical events, fall short in effectively integrate temporal information. To capture both types of information, we model a patient’s EHR as a novel temporal heterogeneous graph. This graph includes historical visits nodes and medical events nodes. It propagates structured information from medical event nodes to visit nodes and utilizes time-aware visit nodes to capture changes in the patient’s health status. Furthermore, we introduce a novel temporal graph transformer (TRANS) that integrates temporal edge features, global positional encoding, and local structural encoding into heterogeneous graph convolution, capturing both temporal and structural information. We validate the effectiveness of TRANS through extensive experiments on three real-world datasets. The results show that our proposed approach achieves state-of-the-art performance.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Health and medicine
Data Mining -> DM: Applications
2243
Towards Sharper Generalization Bounds for Adversarial Contrastive Learning
Wen Wen, Han Li, Tieliang Gong, Hong Chen
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (4/6)
[+] More
[-] Less
Recently, the enhancement on the adversarial robustness of machine learning algorithms has gained significant attention across various application domains. Given the widespread label scarcity issue in real-world data, adversarial contrastive learning (ACL) has been proposed to adversarially train robust models using unlabeled data. Despite the empirical success, its generalization behavior remains poorly understood and far from being well-characterized. This paper aims to address this issue from a learning theory perspective. We establish novel high-probability generalization bounds for the general Lipschitz loss functions. The derived bounds scale O(log(k)) with respect to the number of negative samples k, which improves the existing linear dependency bounds. Our results are generally applicable to many prediction models, including linear models and deep neural networks. In particular, we obtain an optimistic generalization bound O(1/n) under the smoothness assumption of the loss function on the sample size n. To the best of our knowledge, this is the first fast-rate bound valid for ACL. Empirical evaluations on real-world datasets verify our theoretical findings.
List of keywords
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Learning theory
Machine Learning -> ML: Self-supervised Learning
2249
FastScene: Text-Driven Fast Indoor 3D Scene Generation via Panoramic Gaussian Splatting
Yikun Ma, Dandan Zhan, Zhi Jin
6 min. talk | August 6th at 15:00 | Session: CV: 3D computer vision (2/2)
[+] More
[-] Less
Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Scene analysis and understanding   
2255
Motion-Aware Heatmap Regression for Human Pose Estimation in Videos
Inpyo Song, Jongmin Lee, Moonwook Ryu, Jangwon Lee
6 min. talk | August 7th at 11:30 | Session: CV: Biometrics, face, gesture and pose recognition
[+] More
[-] Less
We present an approach to solving 2D human pose estimation in videos. The problem of human pose estimation in videos differs from estimating human poses in static images since videos contain a lot of motion related information. Thus, we investigate how to utilize by the information of the human body movements across in a sequence of video frames for estimating human poses in videos. To do this, we introduce a novel heatmap regression method what we call motion-aware heatmap regression. Our approach computes motion vectors in joint keypoints from adjacent frames. We then design a new style of heatmap that we call Motion-Aware Heatmaps to reflect the motion uncertainty of each joint point. Unlike traditional heatmaps, our motion-aware heatmaps not only consider the current joint locations but also account how joints move over time. Furthermore, we introduce a simple yet effective framework designed to incorporate motion information into heatmap regression. We evaluate our motion-aware heatmap regression on PoseTrack(2018, 21) and Sub-JHMDB datasets. Our results validate that the proposed motion-aware heatmaps significantly improve the precision of human pose estimation in videos, particularly in challenging scenarios such as videos like sports game footage with substantial human motions. (Code and related materials are available at https://github.com/Songinpyo/MTPose.)
List of keywords
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
Computer Vision -> CV: Action and behavior recognition
Computer Vision -> CV: Video analysis and understanding   
2260
Continual Multi-View Clustering with Consistent Anchor Guidance
Chao Zhang, Deng Xu, Xiuyi Jia, Chunlin Chen, Huaxiong Li
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
Multi-view clustering (MVC) has recently attracted much attention. Most existing approaches are designed for fixed multi-view data, and cannot deal with the common streaming data in real world. In this paper, we address this problem by proposing a consistent Anchor guided Continual MVC (ACMVC) method in a two-stage way. In initial learning stage, a low-rank anchor graph based model is constructed. In continual learning stage, to leverage the historical knowledge, the multi-level anchor information is reused to refine the model via adding consistency regularization. It not only provides prior knowledge to enhance the exploration on current data, but also captures the similarity relationship between previous and current data, enabling a comprehensive exploitation on streaming data. The proposed model can be optimized efficiently with linear time and space complexity. Experiments demonstrate the effectiveness and efficiency of our method compared with some state-of-the-art approaches.
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Unsupervised learning
Data Mining -> DM: Mining data streams
2262
Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents
Zihao Zhou, Bin Hu, Chenyang Zhao, Pu Zhang, Bin Liu
6 min. talk | August 8th at 15:00 | Session: ML: Reinforcement learning (2/2)
[+] More
[-] Less
Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https://github.com/ZJLAB-AMMI/LLM4Teach.
List of keywords
Machine Learning -> ML: Reinforcement learning
Natural Language Processing -> NLP: Language models
Uncertainty in AI -> UAI: Sequential decision making
2264
Learning Hierarchy-Enhanced POI Category Representations Using Disentangled Mobility Sequences
Hongwei Jia, Meng Chen, Weiming Huang, Kai Zhao, Yongshun Gong
6 min. talk | August 7th at 10:00 | Session: DM: Mining spatial and/or temporal data (1/2)
[+] More
[-] Less
Points of interest (POIs) carry a wealth of semantic information of varying locations in cities and thus have been widely used to enable various location-based services. To understand POI semantics, existing methods usually model contextual correlations of POI categories in users’ check-in sequences and embed categories into a latent space based on the word2vec framework. However, such an approach does not fully capture the underlying hierarchical relationship between POI categories and can hardly integrate the category hierarchy into various deep sequential models. To overcome this shortcoming, we propose a Semantically Disentangled POI Category Embedding Model (SD-CEM) to generate hierarchy-enhanced category representations using disentangled mobility sequences. Specifically, first, we construct disentangled mobility sequences using human mobility data based on the semantics of POIs. Then we utilize the POI category hierarchy to initialize a hierarchy-enhanced representation for each category in the disentangled sequences, employing an attention mechanism. Finally, we optimize these category representations by incorporating both the masked category prediction task and the next category prediction task. To evaluate the effectiveness of SD-CEM, we conduct comprehensive experiments using two check-in datasets covering three tasks. Experimental results demonstrate that SD-CEM outperforms several competitive baselines, highlighting its substantial improvement in performance as well as the understanding of learned category representations.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
2280
Protecting Split Learning by Potential Energy Loss
Fei Zheng, Chaochao Chen, Lingjuan Lyu, Xinyi Fu, Xing Fu, Weiqiang Wang, Xiaolin Zheng, Jianwei Yin
6 min. talk | August 8th at 10:00 | Session: ML: Federated learning (2/2)
[+] More
[-] Less
As a practical privacy-preserving learning method, split learning has drawn much attention in academia and industry. However, its security is constantly being questioned since the intermediate results are shared during training and inference. In this paper, we focus on the privacy leakage from the forward embeddings of split learning. Specifically, since the forward embeddings contain too much information about the label, the attacker can either use a few labeled samples to fine-tune the top model or perform unsupervised attacks such as clustering to infer the true labels from the forward embeddings. To prevent such kind of privacy leakage, we propose the potential energy loss to make the forward embeddings more ‘complicated’, by pushing embeddings of the same class towards the decision boundary. Therefore, it is hard for the attacker to learn from the forward embeddings. Experiment results show that our method significantly lowers the performance of both fine-tuning attacks and clustering attacks.
List of keywords
Machine Learning -> ML: Federated learning
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Multidisciplinary Topics and Applications -> MTA: Security and privacy
2282
EPIC: Graph Augmentation with Edit Path Interpolation via Learnable Cost
Jaeseung Heo, Seungbeom Lee, Sungsoo Ahn, Dongwoo Kim
6 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Data augmentation plays a critical role in improving model performance across various domains, but it becomes challenging with graph data due to their complex and irregular structure. To address this issue, we propose EPIC (Edit Path Interpolation via learnable Cost), a novel interpolation-based method for augmenting graph datasets. To interpolate between two graphs lying in an irregular domain, EPIC leverages the concept of graph edit distance, constructing an edit path that represents the transformation process between two graphs via edit operations. Moreover, our method introduces a context-sensitive cost model that accounts for the importance of specific edit operations formulated through a learning framework. This allows for a more nuanced transformation process, where the edit distance is not merely count-based but reflects meaningful graph attributes. With randomly sampled graphs from the edit path, we enrich the training set to enhance the generalization capability of classification models. Experimental evaluations across several benchmark datasets demonstrate that our approach outperforms existing augmentation techniques in many tasks.
List of keywords
Machine Learning -> ML: Sequence and graph learning
2284
Estimating Conditional Average Treatment Effects via Sufficient Representation Learning
Pengfei Shi, Wei Zhong, Xinyu Zhang, Ningtao Wang, Xing Fu, Weiqiang Wang, Yin Jin
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (5/6)
[+] More
[-] Less
Estimating the conditional average treatment effects (CATE) is very important in causal inference and has a wide range of applications across many fields. In the estimation process of CATE, the unconfoundedness assumption is typically required to ensure the identifiability of the regression problems. When estimating CATE using high-dimensional data, there have been many variable selection methods and neural network approaches based on representation learning, while these methods do not provide a way to verify whether the subset of variables after dimensionality reduction or the learned representations still satisfy the unconfoundedness assumption during the estimation process, which can lead to ineffective estimates of the treatment effects. Additionally, these methods typically use data from only the treatment or control group when estimating the regression functions for each group. This paper proposes a novel neural network approach named CrossNet to learn a sufficient representation for the features, based on which we then estimate the CATE, where cross indicates that in estimating the regression functions, we used data from their own group as well as cross-utilized data from another group. Numerical simulations and empirical results demonstrate that our method outperforms the competitive approaches.
List of keywords
Machine Learning -> ML: Regression
Machine Learning -> ML: Causality
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Supervised Learning
2309
STAR: Spatio-Temporal State Compression for Multi-Agent Tasks with Rich Observations
Chao Li, Yujing Hu, Shangdong Yang, Tangjie Lv, Changjie Fan, Wenbin Li, Chongjie Zhang, Yang Gao
6 min. talk | August 7th at 15:00 | Session: MAS: Multi-agent learning
[+] More
[-] Less
This paper focuses on the problem of learning compressed state representations for multi-agent tasks. Under the assumption of rich observation, we pinpoint that the state representations should be compressed both spatially and temporally to enable efficient prioritization of task-relevant features, while existing works typically fail. To overcome this limitation, we propose a novel method named Spatio-Temporal stAte compRession (STAR) that explicitly defines both spatial and temporal compression operations on the learned state representations to encode per-agent task-relevant features. Specifically, we first formalize this problem by introducing Task Informed Partially Observable Stochastic Game (TI-POSG). Then, we identify the spatial representation compression in it as encoding the latent states from the joint observations of all agents, and achieve this by learning representations that approximate the latent states based on the information theoretical principle. After that, we further extract the task-relevant features of each agent from these representations by aligning them based on their reward similarities, which is regarded as the temporal representation compression. Structurally, we implement these two compression by learning a set of agent-specific decoding functions and incorporate them into a critic shared by agents for scalable learning. We evaluate our method by developing decentralized policies on 12 maps of the StarCraft Multi-Agent Challenge benchmark, and the superior performance demonstrates its effectiveness.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Machine Learning -> ML: Reinforcement learning
2312
Modeling Personalized Retweeting Behaviors for Multi-Stage Cascade Popularity Prediction
Mingyang Zhou, Yanjie Lin, Gang Liu, Zuwen Li, Hao Liao, Rui Mao
6 min. talk | August 7th at 15:00 | Session: DM: Data Mining (1/2)
[+] More
[-] Less
Predicting the size of message cascades is critical in various applications, such as online advertising and early detection of rumors. However, most existing deep learning approaches rely on cascade observation, which hinders accurate cascade prediction before message posting. Besides, these approaches overlook personalized retweeting behaviors that reflect users’ inclination to retweeting specific types of information. In this study, we propose a universal cascade prediction framework, namely Cascade prediction regarding Multiple Stage (CasMS), that effectively predicts cascade popularity across message generation stage as well as short-term and long-term stages. Unlike previous methods, our approach not only captures users’ personalized retweeting behaviors but also incorporates temporal cascade features. We perform the experiments in datasets collected ourselves as well as public datasets. The results show that our method significantly surpasses existing approaches in predicting the cascade during the message generation stage and different time periods in the cascade dynamics.
List of keywords
Data Mining -> DM: Mining text, web, social media
Data Mining -> DM: Recommender systems
Machine Learning -> ML: Applications
2319
Multi-Granularity Graph-Convolution-Based Method for Weakly Supervised Person Search
Haichun Tai, De Cheng, Jie Li, Nannan Wang, Xinbo Gao
12 min. talk | August 9th at 10:00 | Session: CV: Representation learning
[+] More
[-] Less
One-step Weakly Supervised Person Search (WSPS) jointly performs pedestrian detection and person Re-IDentification (ReID) only with bounding box annotations, which makes the traditional person ReID problem more suitable and efficient for real-world applications. However, this task is very challenging due to the following reasons: 1) large feature gap between person ReID and general object detection tasks when learning shared representations; 2) difficult pseudo identity estimation for each person image with unrefined raw detection and dramatic scale changes. To address above issues, we propose a multi-granularity graph convolution framework to jointly optimize the aligned task features, as well as to assist the pseudo label estimation. Specifically, the multi-granularity feature alignment module (MFA) in the designed two-branch framework, employs cluster-level bi-directional interaction of various granularity information to narrow down the large feature gap. Further, upon the MFA module, we introduce the multi-granularity graph-convolution-based pseudo-label estimation module, to enhance feature representations for distinguishing diverse identities. Extensive experimental results demonstrate the effectiveness of the proposed method, and show superior performances to state-of-the art methods by a large margin on CUHK-SYSU and PRW datasets.
List of keywords
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Image and video retrieval 
Computer Vision -> CV: Recognition (object detection, categorization)
2320
P2P: Transforming from Point Supervision to Explicit Visual Prompt for Object Detection and Segmentation
Guangqian Guo, Dian Shao, Chenguang Zhu, Sha Meng, Xuan Wang, Shan Gao
6 min. talk | August 8th at 10:00 | Session: ML: Weakly supervised learning
[+] More
[-] Less
Point-supervised vision tasks, including detection and segmentation, aiming to learn a network that transforms from points to pseudo labels, have attracted much attention in recent years. However, the lack of precise object size and boundary annotations in the point-supervised condition results in a large performance gap between point- and fully-supervised methods. In this paper, we propose a novel iterative learning framework, Point to Prompt (P2P), for point-supervised object detection and segmentation, with the key insight of transforming from point supervision to explicit visual prompt of the foundation model. The P2P is formulated as an iterative refinement process of two stages: Semantic Explicit Prompt Generation (SEPG) and Prompt Guided Spatial Refinement (PGSR). Specifically, SEPG serves as a prompt generator for generating semantic-explicit prompts from point input via a group-based learning strategy. In the PGSR stage, prompts guide the visual foundation model to further refine the object regions, by leveraging the outstanding generalization ability of the foundation model. The two stages are iterated multiple times to improve the quality of predictions progressively. Experimental results on multiple datasets demonstrate that P2P achieves SOTA performance in both detection and segmentation tasks, further narrowing the performance gap with fully-supervised methods. The source code and supplementary material can be found at https://github.com/guangqian-guo/P2P.
List of keywords
Machine Learning -> ML: Weakly supervised learning
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Segmentation
2322
Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors
Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, Xinbo Gao
12 min. talk | August 9th at 10:00 | Session: CV: Representation learning
[+] More
[-] Less
The remarkable prowess of diffusion models in image generation has spurred efforts to extend their application beyond generative tasks. However, a persistent challenge exists in lacking a unified approach to apply diffusion models to visual perception tasks with diverse semantic granularity requirements. Our purpose is to establish a unified visual perception framework, capitalizing on the potential synergies between generative and discriminative models. In this paper, we propose Vermouth, a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an Adapted-Expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. We emphasize that there is no necessity for incorporating a heavyweight or intricate decoder to transform diffusion models into potent representation learners. Extensive comparative evaluations against tailored discriminative models showcase the efficacy of our approach on zero-shot sketch-based image retrieval (ZS-SBIR), few-shot classification, and open-vocabulary (OV) semantic segmentation tasks. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
List of keywords
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Segmentation
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
2331
A Dataset and Model for Realistic License Plate Deblurring
Haoyan Gong, Yuzheng Feng, Zhenrong Zhang, Xianxu Hou, Jingxin Liu, Siqi Huang, Hongbin Liu
6 min. talk | August 7th at 11:30 | Session: CV: Adversarial learning, adversarial attack and defense methods
[+] More
[-] Less
Vehicle license plate recognition is a crucial task in intelligent traffic management systems. However, the challenge of achieving accurate recognition persists due to motion blur from fast-moving vehicles. Despite the widespread use of image synthesis approaches in existing deblurring and recognition algorithms, their effectiveness in real-world scenarios remains unproven. To address this, we introduce the first large-scale license plate deblurring dataset named License Plate Blur (LPBlur), captured by a dual-camera system and processed through a post-processing pipeline to avoid misalignment issues. Then, we propose a License Plate Deblurring Generative Adversarial Network (LPDGAN) to tackle the license plate deblurring: 1) a Feature Fusion Module to integrate multi-scale latent codes; 2) a Text Reconstruction Module to restore structure through textual modality; 3) a Partition Discriminator Module to enhance the model’s perception of details in each letter. Extensive experiments validate the reliability of the LPBlur dataset for both model training and testing, showcasing that our proposed model outperforms other state-of-the-art motion deblurring methods in realistic license plate deblurring scenarios. The dataset and code are available at https://github.com/haoyGONG/LPDGAN.
List of keywords
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation 
2348
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation
Wentao Jiang, Jing Zhang, Di Wang, Qiming Zhang, Zengmao Wang, Bo Du
6 min. talk | August 9th at 10:00 | Session: CV: Representation learning
[+] More
[-] Less
Due to spatial redundancy in remote sensing images, sparse tokens containing rich information are usually involved in self-attention (SA) to reduce the overall token numbers within the calculation, avoiding the high computational cost issue in Vision Transformers. However, such methods usually obtain sparse tokens by hand-crafted or parallel-unfriendly designs, posing a challenge to reach a better balance between efficiency and performance. Different from them, this paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information meanwhile improving the inference speed. Technically, the meta tokens are first initialized from image tokens via cross-attention. Then, we propose Dual Cross-Attention (DCA) to promote information exchange between image tokens and meta tokens, where they serve as query and key (value) tokens alternatively in a dual-branch structure, significantly reducing the computational complexity compared to self-attention. By employing DCA in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant 1.7 × speedup, fewer parameters, and competitive performance compared to the baseline models, and achieves a better trade-off between efficiency and performance. The code is released at https://github.com/ViTAE-Transformer/LeMeViT.
List of keywords
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Segmentation
2354
Distribution-Independent Cell Type Identification for Single-Cell RNA-seq Data
Yuyao Zhai, Liang Chen, Minghua Deng
6 min. talk | August 8th at 15:00 | Session: MTA: Bioinformatics
[+] More
[-] Less
Automatic cell type annotation aims to transfer the label knowledge from label-abundant reference data to label-scarce target data, which makes encouraging progress in single-cell RNA-seq data analysis. While previous works have focused on classifying close-set cells and detecting open-set cells during testing, it is still essential to be able to classify unknown cell types as human beings. Additionally, few efforts have been devoted to addressing the challenge of common long-tail dilemma in cell type annotation data. Therefore, in this paper, we propose an innovative distribution-independent universal cell type identification framework called scDET from the perspective of autonomously equilibrated dual-consultative contrastive learning. Our model can generate fine-grained predictions for both close-set and open-set cell types in a long-tailed open-world environment. scDET consists of a contrastive-learning branch and a pseudo-labeling branch, which work collaboratively to provide interactive supervision. Specifically, the contrastive-learning branch provides reliable distribution estimation to regularize the predictions of the pseudo-labeling branch, which in turn guides itself through self-balanced knowledge transfer and a designed novel soft contrastive loss. Extensive experimental results on various evaluation datasets demonstrate the superior performance of scDET over other state-of-the-art single-cell clustering and annotation methods.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Multidisciplinary Topics and Applications -> MTA: Other
2360
Efficient Event Stream Super-Resolution with Recursive Multi-Branch Fusion
Quanmin Liang, Zhilin Huang, Xiawu Zheng, Feidiao Yang, Jun Peng, Kai Huang, Yonghong Tian
6 min. talk | August 8th at 11:30 | Session: CV: Image and video synthesis and generation (1/2)
[+] More
[-] Less
Current Event Stream Super-Resolution (ESR) methods overlook the redundant and complementary information present in positive and negative events within the event stream, employing a direct mixing approach for super-resolution, which may lead to detail loss and inefficiency. To address these issues, we propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet) that separates positive and negative events for complementary information extraction, followed by mutual supplementation and refinement. Particularly, we introduce Feature Fusion Modules (FFM) and Feature Exchange Modules (FEM). FFM is designed for the fusion of contextual information within neighboring event streams, leveraging the coupling relationship between positive and negative events to alleviate the misleading of noises in the respective branches. FEM efficiently promotes the fusion and exchange of information between positive and negative branches, enabling superior local information enhancement and global information complementation. Experimental results demonstrate that our approach achieves over 17% and 31% improvement on synthetic and real datasets, accompanied by a 2.3x acceleration. Furthermore, we evaluate our method on two downstream event-driven applications, i.e., object recognition and video reconstruction, achieving remarkable results that outperform existing methods. Our code and Supplementary Material are available at https://github.com/Lqm26/RMFNet.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Other
2365
Parameterized Complexity of Kidney Exchange Revisited
Úrsula Hébert-Johnson, Daniel Lokshtanov, Chinmay Sonar, Vaishali Surianarayanan
6 min. talk | August 7th at 11:30 | Session: MAS: Agent-based and Multi-agent Systems (1/2)
[+] More
[-] Less
As of January 2023, there are more than 90,000 people on the national transplant waiting list in need of a kidney in the United States. These patients often have a friend or family member who is willing to donate, but whose kidney type might not be compatible. To help match these patients to suitable donors, patient-donor compatibility can be modeled as a directed graph. Specifically, in the Kidney Exchange problem, the input is a directed graph G, a subset B of vertices (altruistic donors), and two integers l_p and l_c. An altruistic donor is a donor who is not paired with a patient, and the remaining vertices are patient-donor pairs. Whenever a donor is compatible with a patient from a patient-donor pair, we place a directed edge from the donor vertex to the patient-donor pair. Here the donor vertex can be either altruistic or non-altruistic. The goal is to find a collection of vertex-disjoint cycles and paths covering the maximum number of patients such that each cycle has length at most l_c, and such that each path has length at most l_p and begins at a vertex in B. The path and cycle lengths are bounded so that the surgeries for a given path or cycle can be performed simultaneously. Kidney Exchange has received a great deal of attention in recent years. We contribute to this line of work by closing two open problems from IJCAI ’18 and IJCAI ’22: "Is Kidney Exchange FPT when parameterized by (i) the treewidth (omega) of G and (ii) the number of vertex types in G?” Two vertices have the same vertex type if they have the same in- and out-neighborhoods. We show that Kidney Exchange is FPT parameterized by the number of vertex types. On the other hand, we show W[1]-hardness with respect to omega. We also design a randomized 4^t * n^O(1)-time algorithm parameterized by t, the number of patients helped, significantly improving upon the previous state of the art, which was 161^t * n^O(1).
List of keywords
Agent-based and Multi-agent Systems -> MAS: Resource allocation
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2371
FreqFormer: Frequency-aware Transformer for Lightweight Image Super-resolution
Tao Dai, Jianping Wang, Hang Guo, Jinmin Li, Jinbao Wang, Zexuan Zhu
6 min. talk | August 9th at 10:00 | Session: CV: Applications
[+] More
[-] Less
Transformer-based models have been widely and successfully used in various low-vision visual tasks, and have achieved remarkable performance in single image super-resolution (SR). Despite the significant progress in SR, Transformer-based SR methods (e.g., SwinIR) still suffer from the problems of heavy computation cost and low-frequency preference, while ignoring the reconstruction of rich high-frequency information, hence hindering the representational power of Transformers. To address these issues, in this paper, we propose a novel Frequency-aware Transformer (FreqFormer) for lightweight image SR. Specifically, a Frequency Division Module (FDM) is first introduced to separately handle high- and low-frequency information in a divide-and-conquer manner. Moreover, we present Frequency-aware Transformer Block (FTB) to extracting both spatial frequency attention and channel transposed attention to recover high-frequency details. Extensive experimental results on public datasets demonstrate the superiority of our FreqFormer over state-of-the-art SR methods in terms of both quantitative metrics and visual quality. Code and models are available at https://github.com/JPWang-CS/FreqFormer.
List of keywords
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Interpretability and transparency
Computer Vision -> CV: Machine learning for vision
2383
Machine Unlearning via Null Space Calibration
Huiqiang Chen, Tianqing Zhu, Xin Yu, Wanlei Zhou
6 min. talk | August 9th at 10:00 | Session: ETF: Trustworthy AI
[+] More
[-] Less
Machine unlearning aims to enable models to forget specific data instances when receiving deletion requests. Current research centers on efficient unlearning to erase the influence of data from the model and neglects the subsequent impacts on the remaining data. Consequently, existing unlearning algorithms degrade the model’s performance after unlearning, known as over-unlearning. This paper addresses this critical yet under-explored issue by introducing machine Unlearning via Null Space Calibration (UNSC), which can accurately unlearn target samples without over-unlearning. On the contrary, by calibrating the decision space during unlearning, UNSC can significantly improve the model’s performance on the remaining samples. In particular, our approach hinges on confining the unlearning process to a specified null space tailored to the remaining samples, which is augmented by strategically pseudo-labeling the unlearning samples. Comparison against several established baselines affirms the superiority of our approach.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Accountability
AI Ethics, Trust, Fairness -> ETF: Ethical, legal and societal issues
AI Ethics, Trust, Fairness -> ETF: Other
2388
Encoding Auxiliary Information to Restore Compressed Point Cloud Geometry
Gexin Liu, Jiahao Zhu, Dandan Ding, Zhan Ma
6 min. talk | August 8th at 10:00 | Session: DM: Mining spatial and/or temporal data (2/2)
[+] More
[-] Less
The standardized Geometry-based Point Cloud Compression (G-PCC) suffers from limited coding performance and low-quality reconstruction. To address this, we propose AuxGR, a performance-complexity tradeoff solution for point cloud geometry restoration: leveraging auxiliary bitstream to enhance the quality of G-PCC compressed point cloud geometry. This auxiliary bitstream efficiently encapsulates spatio-temporal information. For static coding, we perform paired information embedding (PIE) on the G-PCC decoded frame by employing target convolutions from its original counterpart, producing an auxiliary bitstream containing abundant original information. For dynamic coding, in addition to PIE, we propose temporal information embedding (TIE) to capture motion information between the previously restored and the current G-PCC decoded frames. TIE applies target kNN attention between them, which ensures the temporal neighborhood construction for each point and implicitly represents motions. Due to the similarity across temporal frames, only the residuals between TIE and PIE outputs are compressed as auxiliary bitstream. Experimental results demonstrate that AuxGR notably outperforms existing methods in both static and dynamic coding scenarios. Moreover, our framework enables the flexible incorporation of auxiliary information under computation constraints, which is attractive to real applications.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
2397
OTOcc: Optimal Transport for Occupancy Prediction
Pengteng Li, Ying He, F. Richard Yu, Pinhao Song, Xingchen Zhou, Guang Zhou
6 min. talk | August 6th at 11:30 | Session: CV: 3D computer vision (1/2)
[+] More
[-] Less
The autonomous driving community is highly interested in 3D occupancy prediction due to its outstanding geometric perception and object recognition capabilities. However, previous methods are limited to existing semantic conversion mechanisms for solving sparse ground truths problem, causing excessive computational demands and sub-optimal voxels representation. To tackle the above limitations, we propose OTOcc, a novel 3D occupancy prediction framework that models semantic conversion from 2D pixels to 3D voxels as Optimal Transport (OT) problem, offering accurate semantic mapping to adapt to sparse scenarios without attention or depth estimation. Specifically, the unit transportation cost between each demander (voxel) and supplier (pixel) pair is defined as the weighted occupancy prediction loss. Then, we utilize the Sinkhorn-Knopp Iteration to find the best mapping matrices with minimal transportation costs. To reduce the computational cost, we propose a block reading technique with multi-perspective feature representation, which also brings fine-grained scene understanding. Extensive experiments show that OTOcc not only has the competitive prediction performance but also has about more than 4.58% reduction in computational overhead compared to state-of-the-art methods.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Machine learning for vision
2403
SCAT: A Time Series Forecasting with Spectral Central Alternating Transformers
Chengjie Zhou, Chao Che, Pengfei Wang, Qiang Zhang
6 min. talk | August 8th at 15:00 | Session: ML: Time series and data streams
[+] More
[-] Less
Time series forecasting has essential applications across various domains. For instance, forecasting power time series can optimize energy usage and bolster grid stability and reliability. Existing models based on transformer architecture are limited to classical design, ignoring the impact of spatial information and noise on model architecture design. Therefore, we propose an atypical design of Transformer-based models for multivariate time series forecasting. This design consists of two critical components: (i) spectral clustering center of time series employed as the focal point for attention computation; (ii) alternating attention mechanism wherein each query transformer is compatible with spectral clustering centers, executing attention at the sequence level instead of the token level. The alternating design has a two-fold benefit: firstly, it eliminates the uncertainty noise present in the dependent variable sequence of the channel input, and secondly, it incorporates the Euclidean distance to mitigate the impact of extreme values on the attention matrix, thereby aligning predictions more closely to the sequence’s natural progression. Experiments on ten real-world datasets, encompassing Wind, Electricity, Weather, and others, demonstrate that our Spectral Central Alternating Transformer (SCAT) outperforms state-of-the-art methods (SOTA) by an average of 42.8% in prediction performance in power time series forecasting.
List of keywords
Machine Learning -> ML: Time series and data streams
2414
Laying the Foundations for Solving FOND HTN Problems: Grounding, Search, Heuristics (and Benchmark Problems)
Mohammad Yousefi, Pascal Bercher
6 min. talk | August 7th at 15:00 | Session: PS: Planning and Scheduling (1/2)
[+] More
[-] Less
Building upon recent advancements in formalising Fully Observable Non-Deterministic (FOND) Hierarchical Task Network (HTN) planning, we present the first approach to find strong solutions for HTN problems with uncertainty in action outcomes. We present a search algorithm, along with a compilation that relaxes a FOND HTN problem to a deterministic one. This allows the utilisation of existing grounders and heuristics from the deterministic HTN planning literature.
List of keywords
Planning and Scheduling -> PS: Hierarchical planning
Planning and Scheduling -> PS: Planning algorithms
Planning and Scheduling -> PS: Planning under uncertainty
Planning and Scheduling -> PS: Search in planning and scheduling
2426
AnchorGT: Efficient and Flexible Attention Architecture for Scalable Graph Transformers
Wenhao Zhu, Guojie Song, Liang Wang, Shaoguo Liu
6 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Graph Transformers (GTs) have significantly advanced the field of graph representation learning by overcoming the limitations of message-passing graph neural networks (GNNs) and demonstrating promising performance and expressive power. However, the quadratic complexity of self-attention mechanism in GTs has limited their scalability, and previous approaches to address this issue often suffer from expressiveness degradation or lack of versatility. To address this issue, we propose AnchorGT, a novel attention architecture for GTs with global receptive field and almost linear complexity, which serves as a flexible building block to improve the scalability of a wide range of GT models. Inspired by anchor-based GNNs, we employ structurally important k-dominating node set as anchors and design an attention mechanism that focuses on the relationship between individual nodes and anchors, while retaining the global receptive field for all nodes. With its intuitive design, AnchorGT can easily replace the attention module in various GT models with different network architectures and structural encodings, resulting in reduced computational overhead without sacrificing performance. In addition, we theoretically prove that AnchorGT attention can be strictly more expressive than Weisfeiler-Lehman test, showing its superiority in representing graph structures. Our experiments on three state-of-the-art GT models demonstrate that their AnchorGT variants can achieve similar results while being faster and significantly more memory efficient.
List of keywords
Machine Learning -> ML: Sequence and graph learning
2434
Diffusion Mask-Driven Visual-language Tracking
Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shuxiang Song
6 min. talk | August 8th at 10:00 | Session: CV: Motion and tracking
[+] More
[-] Less
Most existing visual-language trackers greatly rely on the initial language descriptions on a target object to extract their multi-modal features. However, the initial language descriptions are often inaccurate in a highly time-varying video sequence and thus greatly deteriorate their tracking performance due to the low quality of extracted multi-modal features. To address this challenge, we propose a Diffusion Mask-Driven Visual-language Tracker (DMTrack) based on a diffusion model. Confronting the issue of low-quality multi-modal features due to inaccurate language descriptions, we leverage the diffusion model to capture high-quality semantic information from multi-modal features and transform it into target mask features. During the training phase, we further enhance the diffusion model’s perception of pixel-level features by calculating the loss between the target mask features and the ground truth masks. Additionally, we perform joint localization of the target using both target mask features and visual features, instead of relying solely on multi-modal features for localization. Through extensive experiments on four tracking benchmarks (i.e., LaSOT, TNL2K, LaSOText, and OTB-Lang), we validate that our proposed Diffusion Mask-Driven Visual-language Tracker can improve the robustness and effectiveness of the model.
List of keywords
Computer Vision -> CV: Motion and tracking
2447
Unified Unsupervised Salient Object Detection via Knowledge Transfer
Yao Yuan, Wutao Liu, Pan Gao, Qun Dai, Jie Qin
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a pre-trained deep network. This mechanism starts with easy samples and progressively moves towards harder ones, to avoid initial interference caused by hard samples. Afterwards, the obtained saliency cues are utilized to train a saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning method is devised to transfer the acquired saliency knowledge, leveraging shared knowledge to attain superior transferring performance on the target tasks. Extensive experiments on five representative SOD tasks confirm the effectiveness and feasibility of our proposed method. Code and supplement materials are available at https://github.com/I2-Multimedia-Lab/A2S-v3.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Scene analysis and understanding   
Machine Learning -> ML: Unsupervised learning
2454
How to Learn Domain-Invariant Representations for Visual Reinforcement Learning: An Information-Theoretical Perspective
Shuo Wang, Zhihao Wu, Jinwen Wang, Xiaobo Hu, Youfang Lin, Kai Lv
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
Despite the impressive success in visual control challenges, Visual Reinforcement Learning (VRL) policies have struggled to generalize to other scenarios. Existing works attempt to empirically improve the generalization capability, lacking theoretical support. In this work, we explore how to learn domain-invariant representations for VRL from an information-theoretical perspective. Specifically, we identify three Mutual Information (MI) terms. These terms highlight that a robust representation should preserve domain invariant information (return and dynamic transition) under significant observation perturbation. Furthermore, we relax the MI terms to derive three components for implementing a practical Mutual Information-based Invariant Representation (MIIR) algorithm for VRL. Extensive experiments demonstrate that MIIR achieves state-of-the-art generalization performance and the best sample efficiency in the DeepMind Control suite, Robotic Manipulation, and Carla.
List of keywords
Computer Vision -> CV: Embodied vision: Active agents, simulation
2462
Improved Parallel Algorithm for Non-Monotone Submodular Maximization under Knapsack Constraint
Tan D. Tran, Canh V. Pham, Dung T. K. Ha, Phuong N. H. Pham
6 min. talk | August 7th at 10:00 | Session: CSO: Constraint optimization problems
[+] More
[-] Less
This work proposes an efficient parallel algorithm for non-monotone submodular maximization under a knapsack constraint problem over the ground set of size n. Our algorithm improves the best approximation factor of the existing parallel one from 8 to 7 with O(log n) adaptive complexity. The key idea of our approach is to create an alternate threshold algorithmic framework. This new strategy alternately constructs two disjoint candidate solutions within a constant number of sequence rounds. Then, the algorithm boosts solution quality without sacrificing the adaptive complexity. Extensive experimental studies on three applications, Revenue Maximization, Image Summarization, and Maximum Weighted Cut, show that our algorithm not only significantly increases solution quality but also requires comparative adaptivity to state-of-the-art algorithms.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Applications
Constraint Satisfaction and Optimization -> CSO: Constraint learning and acquisition
Data Mining -> DM: Big data and scalability
2476
Revealing the Two Sides of Data Augmentation: An Asymmetric Distillation-based Win-Win Solution for Open-Set Recognition
Yunbing Jia, Xiaoyu Kong, Fan Tang, Yixing Gao, Weiming Dong, Yi Yang
6 min. talk | August 9th at 10:00 | Session: CV: Representation learning
[+] More
[-] Less
In this paper, we reveal the two sides of data augmentation: enhancements in closed-set recognition correlate with a significant decrease in open-set recognition. Through empirical investigation, we find that multi-sample-based augmentations would contribute to reducing feature discrimination, thereby diminishing the open-set criteria. Although knowledge distillation could impair the feature via imitation, the mixed feature with ambiguous semantics hinders the distillation. To this end, we propose an asymmetric distillation framework by feeding teacher model extra raw data to enlarge the benefit of teacher. Moreover, a joint mutual information loss and a selective relabel strategy are utilized to alleviate the influence of hard mixed samples. Our method successfully mitigates the decline in open-set and outperforms SOTAs by 2%~3% AUROC on the Tiny-ImageNet dataset and experiments on large-scale dataset ImageNet-21K demonstrate the generalization of our method.
List of keywords
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Structural and model-based approaches, knowledge representation and reasoning
2484
CMACE: CMAES-based Counterfactual Explanations for Black-box Models
Xudong Yin, Yao Yang
6 min. talk | August 6th at 11:30 | Session: ETF: Explainability and interpretability
[+] More
[-] Less
Explanatory Artificial Intelligence plays a vital role in machine learning, due to its widespread application in decision-making scenarios, e.g., credit lending. Counterfactual Explanation (CFE) is a new kind of explanatory method that involves asking “what if ”, i.e. what would have happened if model inputs slightly change. To answer the question, Counterfactual Explanation aims at finding a minimum perturbation in model inputs leading to a different model decision. Compared with model-agnostic approaches, model-specific CFE approaches designed only for specific type of models usually have better performance in finding optimal counterfactual perturbations, owing to access to the inner workings of models. To deal with this dilemma, this work first proposes CMAES-based Counterfactual Explanations (CMACE): an effective model-agnostic counterfactual generating approach based on Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and a warm starting scheme that provides good initialization of the counterfactual’s mean and covariance parameters for CMA-ES taking advantage of prior information of training samples. CMACE significantly outperforms another state-of-art (SOTA) model-agnostic approach (Bayesian Counterfactual Generator, BayCon) with various experimental settings. Extensive experiments also demonstrate that CMACE is superior to a SOTA model-specific approach (Flexible Optimizable Counterfactual Explanations for Tree Ensembles, FOCUS) that is designed for tree-based models using gradient-based optimization.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Explainability and interpretability
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
Machine Learning -> ML: Optimization
Machine Learning -> ML: Trustworthy machine learning
2486
Learning Low-Rank Tensor Cores with Probabilistic ℓ0-Regularized Rank Selection for Model Compression
Tianxiao Cao, Lu Sun, Canh Hao Nguyen, Hiroshi Mamitsuka
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (2/6)
[+] More
[-] Less
Compressing deep neural networks is of great importance for real-world applications on resource-constrained devices. Tensor decomposition is one promising answer that retains the functionality and most of the expressive power of the original deep models by replacing the weights with their decomposed cores. Decomposition with optimal ranks can achieve a good compression-accuracy trade-off, but it is expensive to optimize due to its discrete and combinatorial nature. A common practice is to set all ranks equal and tune one hyperparameter, but it may significantly harm the flexibility and generalization. In this paper, we propose a novel automatic rank selection method for deep model compression that allows learning model weights and decomposition ranks simultaneously. We propose to penalize the ℓ0 (quasi-)norm of the slices of decomposed tensor cores during model training. To avoid combinatorial optimization, we develop a probabilistic formulation and apply an approximate Bernoulli gate to each of the slices of tensor cores, which can be implemented in an end-to-end and scalable framework via gradient descent. It enables the automatic rank selection to be incorporated with arbitrary tensor decompositions and neural network layers such as linear layers, convolutional layers, and embedding layers. Comprehensive experiments on various tasks, including image classification, text sentiment classification, and neural machine translation, demonstrate the superior effectiveness of the proposed method over baselines.
List of keywords
Machine Learning -> ML: Matrix/tensor methods
Machine Learning -> ML: Learning sparse models
2495
A Complete Landscape of EFX Allocations on Graphs: Goods, Chores and Mixed Manna
Yu Zhou, Tianze Wei, Minming Li, Bo Li
6 min. talk | August 8th at 10:00 | Session: GTEP: Fair division
[+] More
[-] Less
We study envy-free up to any item (EFX) allocations on graphs where vertices and edges represent agents and items respectively. An agent is only interested in items that are incident to her and all other items have zero marginal values to her. Christodoulou et al. first proposed this setting and studied the case of goods. We extend this setting to the case of mixed manna where an item may be liked or disliked by its endpoint agents. In our problem, an agent has an arbitrary valuation over her incident items such that the items she likes have non-negative marginal values to her and those she dislikes have non-positive marginal values. We provide a complete study of the four notions of EFX for mixed manna in the literature, which differ by whether the removed item can have zero marginal value. We prove that an allocation that satisfies the notion of EFX where the virtually-removed item could always have zero marginal value may not exist and determining its existence is NP-complete, while one that satisfies any of the other three notions always exists and can be computed in polynomial time. We also prove that an orientation (i.e., a special allocation where each edge must be allocated to one of its endpoint agents) that satisfies any of the four notions may not exist, and determining its existence is NP-complete.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Fair division
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2504
G2LTraj: A Global-to-Local Generation Approach for Trajectory Prediction
Zhanwei Zhang, Zishuo Hua, Minghao Chen, Wei Lu, Binbin Lin, Deng Cai, Wenxiao Wang
6 min. talk | August 7th at 10:00 | Session: DM: Mining spatial and/or temporal data (1/2)
[+] More
[-] Less
Predicting future trajectories of traffic agents accurately holds substantial importance in various applications such as autonomous driving. Previous methods commonly infer all future steps of an agent either recursively or simultaneously. However, the recursive strategy suffers from the accumulated error, while the simultaneous strategy overlooks the constraints among future steps, resulting in kinematically infeasible predictions. To address these issues, in this paper, we propose G2LTraj, a plug-and-play global-to-local generation approach for trajectory prediction. Specifically, we generate a series of global key steps that uniformly cover the entire future time range. Subsequently, the local intermediate steps between the adjacent key steps are recursively filled in. In this way, we prevent the accumulated error from propagating beyond the adjacent key steps. Moreover, to boost the kinematical feasibility, we not only introduce the spatial constraints among key steps but also strengthen the temporal constraints among the intermediate steps. Finally, to ensure the optimal granularity of key steps, we design a selectable granularity strategy that caters to each predicted trajectory. Our G2LTraj significantly improves the performance of seven existing trajectory predictors across the ETH, UCY and nuScenes datasets. Experimental results demonstrate its effectiveness. Code will be available at https://github.com/Zhanwei-Z/G2LTraj.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
Computer Vision -> CV: Motion and tracking
Machine Learning -> ML: Time series and data streams
2505
TFCD: Towards Multi-modal Sarcasm Detection via Training-Free Counterfactual Debiasing
Zhihong Zhu, Xianwei Zhuang, Yunyan Zhang, Derong Xu, Guimin Hu, Xian Wu, Yefeng Zheng
12 min. talk | August 7th at 10:00 | Session: NLP: Sentiment analysis, stylistic analysis, and argument mining
[+] More
[-] Less
Multi-modal sarcasm detection (MSD), which aims to identify whether a given sample with multi-modal information (i.e., text and image) is sarcastic, has garnered widespread attention. Recent approaches focus on designing sophisticated architectures or mechanisms to extract sarcastic cues from entire or local image and text features. Nevertheless, a long-overlooked issue is that current MSD task invariably suffers from unintended dataset biases, especially the statistical label bias and sarcasmless word bias. Concretely, such harmful biases are confounders that may mislead existing models to learn spurious correlations, significantly limiting models’ performance. To tackle this issue, this paper proposes a Training-Free Counterfactual Debiasing framework TFCD, which first formulates the causalities among variables in MSD via a tailored causal graph. Then, TFCD extracts biases from the conventionally-trained model by generating counterfactual utterances and contexts and mitigates them using element-wise subtraction. Extensive experiments on two benchmarks demonstrate the effectiveness of the proposed TFCD. Remarkably, TFCD requires neither data balancing nor model modifications, and thus can be seamlessly integrated into diverse state-of-the-art approaches and achieve considerable improvement margins.
List of keywords
Natural Language Processing -> NLP: Sentiment analysis, stylistic analysis, and argument mining
2519
Partial Optimal Transport Based Out-of-Distribution Detection for Open-Set Semi-Supervised Learning
Yilong Ren, Chuanwen Feng, Xike Xie, S. Kevin Zhou
6 min. talk | August 8th at 11:30 | Session: ML: Semi-supervised learning
[+] More
[-] Less
Semi-supervised learning (SSL) is a machine learning paradigm that utilizes both labeled and unlabeled data to enhance the performance of learning tasks. However, SSL methods operate under the assumption that the label spaces of labeled and unlabeled data are identical, which may not hold in open-world applications. In such scenarios, the unlabeled data may contain novel categories that were not presented in the labeled training data, essentially outliers. This specific challenge is referred to as the Open-set Semi-supervised Learning (OSSL) problem. In OSSL, a pivotal concern is the detection of out-of-distribution (OOD) samples within unlabeled data. Existing methods often struggle to provide effective OOD detection strategies, especially when dealing with datasets comprising a large number of training categories. In response to this challenge, we model the OOD detection problem in OSSL as a partial optimal transport (POT) problem. With POT theory, we devise a mass score function to measure the likelihood of a sample being an outlier, which enables a binary classifier for OOD detection. Further, we put forward an OOD loss, enabling the seamless integration of the binary classifier and off-the-shelf SSL methods under OSSL settings, all within an end-to-end training framework. We extensively evaluate our proposal under various datasets and OSSL configurations, consistently demonstrating the superior performance of our proposal. Codes are available at https://github.com/ryl0427/Code_for_POT_OSSL.
List of keywords
Machine Learning -> ML: Semi-supervised learning
Machine Learning -> ML: Optimization
Machine Learning -> ML: Robustness
2536
MMVQA: A Comprehensive Dataset for Investigating Multipage Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, Soyeon Caren Han
6 min. talk | August 6th at 11:30 | Session: ML: Multi-modal learning
[+] More
[-] Less
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly with lengthy textual content. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. The paper introduces PDF-MVQA, tailored for research journal articles, encompassing multiple pages and multimodal retrieval. Our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. The main contribution is introducing a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. We aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA. Code and Appendix are in https://github.com/adlnlp/pdfmvqa
List of keywords
Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Resources and evaluation
Natural Language Processing -> NLP: Information extraction
2537
A Deep Probabilistic Spatiotemporal Framework for Dynamic Graph Representation Learning with Application to Brain Disorder Identification
Sin-Yee Yap, Junn Yong Loo, Chee-Ming Ting, Fuad Noman, Raphaël C.-W. Phan, Adeel Razi, David L. Dowe
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (2/6)
[+] More
[-] Less
Recent applications of pattern recognition techniques on brain connectome classification using functional connectivity (FC) are shifting towards acknowledging the non-Euclidean topology and dynamic aspects of brain connectivity across time. In this paper, a deep spatiotemporal variational Bayes (DSVB) framework is proposed to learn time-varying topological structures in dynamic FC networks for identifying autism spectrum disorder (ASD) in human participants. The framework incorporates a spatial-aware recurrent neural network with an attention-based message passing scheme to capture rich spatiotemporal patterns across dynamic FC networks. To overcome model overfitting on limited training datasets, an adversarial training strategy is introduced to learn graph embedding models that generalize well to unseen brain networks. Evaluation on the ABIDE resting-state functional magnetic resonance imaging dataset shows that our proposed framework substantially outperforms state-of-the-art methods in identifying patients with ASD. Dynamic FC analyses with DSVB-learned embeddings reveal apparent group differences between ASD and healthy controls in brain network connectivity patterns and switching dynamics of brain states.
List of keywords
Machine Learning -> ML: Learning graphical models
Machine Learning -> ML: Probabilistic machine learning
Machine Learning -> ML: Adversarial machine learning
Multidisciplinary Topics and Applications -> MTA: Health and medicine
2540
MARS: Multimodal Active Robotic Sensing for Articulated Characterization
Hongliang Zeng, Ping Zhang, Chengjiong Wu, Jiahua Wang, Tingyu Ye, Fang Li
6 min. talk | August 6th at 11:30 | Session: CV: 3D computer vision (1/2)
[+] More
[-] Less
Precise perception of articulated objects is vital for empowering service robots. Recent studies mainly focus on point cloud, a single-modal approach, often neglecting vital texture and lighting details and assuming ideal conditions like optimal viewpoints, unrepresentative of real-world scenarios. To address these limitations, we introduce MARS, a novel framework for articulated object characterization. It features a multi-modal fusion module utilizing multi-scale RGB features to enhance point cloud features, coupled with reinforcement learning-based active sensing for autonomous optimization of observation viewpoints. In experiments conducted with various articulated object instances from the PartNet-Mobility dataset, our method outperformed current state-of-the-art methods in joint parameter estimation accuracy. Additionally, through active sensing, MARS further reduces errors, demonstrating enhanced efficiency in handling suboptimal viewpoints. Furthermore, our method effectively generalizes to real-world articulated objects, enhancing robot interactions. Code is available at https://github.com/robhlzeng/MARS.
List of keywords
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Multimodal learning
Robotics -> ROB: Perception
Robotics -> ROB: Manipulation
2565
Graph Attention Network with High-Order Neighbor Information Propagation for Social Recommendation
Fei Xiong, Haoran Sun, Guixun Luo, Shirui Pan, Meikang Qiu, Liang Wang
6 min. talk | August 9th at 11:30 | Session: DM: Mining graphs (3/3)
[+] More
[-] Less
In recommender systems, graph neural networks (GNN) can integrate interactions between users and items with their attributes, which makes GNN-based methods more powerful. However, directly stacking multiple layers in a graph neural network can easily lead to over-smoothing, hence recommendation systems based on graph neural networks typically underutilize higher-order neighborhoods in their learning. Although some heterogeneous graph random walk methods based on meta-paths can achieve higher-order aggregation, the focus is predominantly on the nodes at the ends of the paths. Moreover, these methods require manually defined meta-paths, which limits the model’s expressiveness and flexibility. Furthermore, path encoding in graph neural networks usually focuses only on the sequence leading to the target node. However, real-world interactions often do not follow this strict sequence, limiting the predictive performance of sequence-based network models. These problems prevent GNN-based methods from being fully effective. We propose a Graph Attention network with Information Propagation path aggregation for Social Recommendation (GAIPSRec). Firstly, we propose a universal heterogeneous graph sampling framework that does not require manually defining meta-paths for path sampling, thereby offering greater flexibility. Moreover, our method takes into account all nodes on the aggregation path and is capable of learning information from higher-order neighbors without succumbing to over-smoothing. Finally, our method utilizes a gate mechanism to fuse sequential and non-sequential dependence in encoding path instances, allowing a more holistic view of the data. Extensive experiments on real-world datasets show that our proposed GAIPSRec improves the performance significantly and outperforms state-of-the-art methods.
List of keywords
Data Mining -> DM: Mining graphs
Data Mining -> DM: Recommender systems
2594
C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning
Ji Ma, Wei Suo, Peng Wang, Yanning Zhang
6 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., “exposure bias” problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.
List of keywords
Computer Vision -> CV: Vision, language and reasoning
Computer Vision -> CV: Multimodal learning
Natural Language Processing -> NLP: Language models
2608
FineFMPL: Fine-grained Feature Mining Prompt Learning for Few-Shot Class Incremental Learning
Hongbo Sun, Jiahuan Zhou, Xiangteng He, Jinglin Xu, Yuxin Peng
6 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
Few-Shot Class Incremental Learning (FSCIL) aims to continually learn new classes with few training samples without forgetting already learned old classes. Existing FSCIL methods generally fix the backbone network in incremental sessions to achieve a balance between suppressing forgetting old classes and learning new classes. However, the fixed backbone network causes insufficient learning of new classes from a few samples. Benefiting from the powerful visual and textual understanding ability of Vision-Language (VL) pre-training models, we propose a Fine-grained Feature Mining Prompt Learning (FineFMPL) approach to adapt the VL model to FSCIL, which comprehensively learns and memorizes fine-grained discriminative information of emerging classes. Concretely, the visual probe prompt is firstly proposed to guide the image encoder of VL model to extract global-level coarse-grained features and object-level fine-grained features, and visual prototypes are preserved based on image patch significance, which contains the discriminative characteristics exclusive to the class. Secondly, the textual context prompt is constructed by cross-modal mapping of visual prototypes, feeding into the text encoder of VL model to memorize the class information as textual prototypes. Finally, integrating visual and textual prototypes based on fine-grained feature mining into the model improves the recognition performance of all classes in FSCIL. Extensive experiments on three benchmark datasets demonstrate that our FineFMPL achieves new state-of-the-art. The code is available at https://github.com/PKU-ICST-MIPL/FineFMPL_IJCAI2024.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
2616
Joint Multimodal Aspect Sentiment Analysis with Aspect Enhancement and Syntactic Adaptive Learning
Linlin Zhu, Heli Sun, Qunshu Gao, Tingzhou Yi, Liang He
6 min. talk | August 7th at 10:00 | Session: NLP: Sentiment analysis, stylistic analysis, and argument mining
[+] More
[-] Less
As an important task in sentiment analysis, joint multimodal aspect sentiment analysis (JMASA) has received increasing attention in recent years. However, previous approaches either i) directly fuse multimodal data without fully exploiting the correlation between multimodal input data, or ii) equally utilize the dependencies of words in the text for sentiment analysis, ignoring the differences in the importance of different words. To address these limitations, we propose a joint multimodal sentiment analysis method based on Aspect Enhancement and Syntactic Adaptive Learning (AESAL). Specifically, we construct an aspect enhancement pre-training task to enable the model to fully learn the correlation of aspects between multimodal input data. In order to capture the differences in the importance of different words in the text, we design a syntactic adaptive learning mechanism. First, we construct different syntactic dependency graphs based on the distance between words to learn global and local information in the text. Second, we use a multi-channel adaptive graph convolutional network to maintain the uniqueness of each modality while fusing the correlations between different modalities. Experimental results on benchmark datasets show that our method outperforms state-of-the-art methods.
List of keywords
Natural Language Processing -> NLP: Sentiment analysis, stylistic analysis, and argument mining
Computer Vision -> CV: Multimodal learning
2617
Pluggable Watermarking of Deepfake Models for Deepfake Detection
Han Bao, Xuhong Zhang, Qinying Wang, Kangming Liang, Zonghui Wang, Shouling Ji, Wenzhi Chen
6 min. talk | August 9th at 10:00 | Session: ETF: Trustworthy AI
[+] More
[-] Less
Deepfake model misuse poses major security concerns. Existing passive and active Deepfake detection methods both suffer from a lack of generalizability and robustness. In this study, we propose a pluggable and efficient active model watermarking framework for Deepfake detection. This approach facilitates the embedding of identification watermarks across a variety of Deepfake generation models, enabling easy extraction by authorities for detection purposes. Specifically, our method leverages the universal convolutional structure in generative model decoders. It employs convolutional kernel sparsification for adaptive watermark embedding positioning and introduces convolutional kernel normalization to seamlessly integrate watermark parameters with those of the generative model. For watermark extraction, we jointly train a watermark extractor based on a Deepfake detection model and use BCH encoding to identify watermark images effectively. Finally, we apply our approach to eight major types of Deepfake generation models. Experiments show our method successfully detects Deepfakes with an average accuracy exceeding 94% even in heavy lossy channels. This approach operates independently of the generation model’s training without affecting the original model’s performance. Furthermore, our model requires training a very limited number of parameters, and it is resilient against three major adaptive attacks. The source code can be found at https://github.com/GuaiZao/Pluggable-Watermarking
List of keywords
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
2622
PEACH: Pretrained-Embedding Explanation across Contextual and Hierarchical Structure
Feiqi Cao, Soyeon Caren Han, Hyunsuk Chung
12 min. talk | August 7th at 15:00 | Session: NLP: Natural Language Processing (2/3)
[+] More
[-] Less
In this work, we propose a novel tree-based explanation technique, PEACH (Pretrained-embedding Explanation Across Contextual and Hierarchical Structure), that can explain how text-based documents are classified by using any pretrained contextual embeddings in a tree-based human-interpretable manner. Note that PEACH can adopt any contextual embeddings of the PLMs as a training input for the decision tree. Using the proposed PEACH, we perform a comprehensive analysis of several contextual embeddings on nine different NLP text classification benchmarks. This analysis demonstrates the flexibility of the model by appling several PLM contextual embeddings, its attribute selections, scaling, and clustering methods. Furthermore, we show the utility of explanations by visualising the feature selection and important trend of text classification via human-interpretable word-cloud-based trees, which clearly identify model mistakes and assist in dataset debugging. Besides interpretability, PEACH outperforms or is similar to those from pretrained models. Code and Appendix are in https://github.com/adlnlp/peach.
List of keywords
Natural Language Processing -> NLP: Interpretability and analysis of models for NLP
Knowledge Representation and Reasoning -> KRR: Other
2656
An Efficient Prototype-Based Clustering Approach for Edge Pruning in Graph Neural Networks to Battle Over-Smoothing
Yuyang Huang, Wenjing Lu, Yang Yang
6 min. talk | August 8th at 15:00 | Session: ML: Sequence and graph learning
[+] More
[-] Less
Topology augmentation is a popular strategy to address the issue of over-smoothing in graph neural networks (GNNs). To prevent potential distortion of node representations, an essential principle is to enhance the separability between embeddings of nodes from different classes while preserving smoothness among nodes of the same class. However, differentiating between inter-class and intra-class edges becomes arduous when class labels are unavailable or the graph is partially labeled. While clustering offers an alternative for identifying closely connected groups of nodes, traditional clustering methods face challenges when applied to GNNs in terms of accuracy, efficiency, adaptability, and scalability to diverse graphs. To address these limitations, we introduce ClusterDrop, which uses learnable prototypes for efficient clustering and incorporates supervised signals to enhance accuracy and adaptability across different graphs. Experiments on six datasets with varying graph structures demonstrate its effectiveness in alleviating over-smoothing and enhancing GNN performance.
List of keywords
Machine Learning -> ML: Sequence and graph learning
Data Mining -> DM: Mining graphs
Data Mining -> DM: Networks
2666
Incorporating Schema-Aware Description into Document-Level Event Extraction
Zijie Xu, Peng Wang, Wenjun Ke, Guozheng Li, Jiajun Liu, Ke Ji, Xiye Chen, Chenxiao Wu
6 min. talk | August 7th at 11:30 | Session: NLP: Information extraction
[+] More
[-] Less
Document-level event extraction (DEE) aims to extract the structured event information from a given document, facing two critical challenges: (1) event arguments always scatter across sentences (arguments-scattering); (2) multiple events can co-occur in one document (multi-event). Most recent studies mainly follow two simplified settings to ease the challenges: one simplifies DEE with the no-trigger-words design (NDEE), and the other focuses on event argument extraction (DEAE), a sub-task of DEE. However, the former excludes trigger extraction and suffers from error propagation in the sub-tasks. The latter relies heavily on the gold triggers as prerequisites and struggles to distinguish multiple arguments playing the same role in different events. To address the limitations above, we propose a novel joint trigger and argument extraction paradigm SEELE to enhance the DEE model via incorporating SchEma-awarE descriptions into Document-Level Event extraction. Specifically, the schema-aware descriptions are leveraged from two aspects: (1) guiding the attention mechanism among event-aware tokens across sentences, which relieves arguments-scattering without error propagation; (2) performing the fine-grained contrastive learning to distinguish different events, which mitigates multi-event without gold triggers. Extensive experiments show the superiority of SEELE, achieving notable improvements (2.1% to 9.7% F1) on three NDEE datasets and competitive performance on two DEAE datasets. Our code is available at https://github.com/TheoryRhapsody/SEELE.
List of keywords
Natural Language Processing -> NLP: Information extraction
2669
Bridging the Gap between General and Down-Closed Convex Sets in Submodular Maximization
Loay Mualem, Murad Tukan, Moran Feldman
12 min. talk | August 7th at 10:00 | Session: CSO: Constraint optimization problems
[+] More
[-] Less
Optimization of DR-submodular functions has experienced a notable surge in significance in recent times, marking a pivotal development within the domain of non-convex optimization. Motivated by real-world scenarios, some recent works have delved into the maximization of non-monotone DR-submodular functions over general (not necessarily down-closed) convex set constraints. Up to this point, these works have all used the minimum L-infinity norm of any feasible solution as a parameter. Unfortunately, a recent hardness result due to Mualem and Feldman shows that this approach cannot yield a smooth interpolation between down-closed and non-down-closed constraints. In this work, we suggest novel offline and online algorithms that provably provide such an interpolation based on a natural decomposition of the convex body constraint into two distinct convex bodies: a down-closed convex body and a general convex body. We also empirically demonstrate the superiority of our proposed algorithms across three offline and two online applications.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Online learning
Search -> S: Combinatorial search and optimisation
2671
CLIP-FSAC: Boosting CLIP for Few-Shot Anomaly Classification with Synthetic Anomalies
Zuo Zuo, Yao Wu, Baoqiang Li, Jiahao Dong, You Zhou, Lei Zhou, Yanyun Qu, Zongze Wu
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Few-shot anomaly classification (FSAC) is a vital task in manufacturing industry. Recent methods focus on utilizing CLIP in zero/few normal shot anomaly detection instead of custom models. However, there is a lack of specific text prompts in anomaly classification and most of them ignore the modality gap between image and text. Meanwhile, there is distribution discrepancy between the pre-trained and the target data. To provide a remedy, in this paper, we propose a method to boost CLIP for few-normal-shot anomaly classification, dubbed CLIP-FSAC, which contains two-stage of training and alternating fine-tuning with two modality-specific adapters. Specifically, in the first stage, we train image adapter with text representation output from text encoder and introduce an image-to-text tuning to enhance multi-modal interaction and facilitate a better language-compatible visual representation. In the second stage, we freeze the image adapter to train the text adapter. Both of them are constrained by fusion-text contrastive loss. Comprehensive experiment results are provided for evaluating our method in few-normal-shot anomaly classification, which outperforms the state-of-the-art method by 12.2%, 10.9%, 10.4% AUROC on VisA for 1, 2, and 4-shot settings.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Recognition (object detection, categorization)
2674
Learning Big Logical Rules by Joining Small Rules
Céline Hocquette, Andreas Niskanen, Rolf Morel, Matti Järvisalo, Andrew Cropper
6 min. talk | August 7th at 11:30 | Session: KRR: Logic programming
[+] More
[-] Less
A major challenge in inductive logic programming is learning big rules. To address this challenge, we introduce an approach where we join small rules to learn big rules. We implement our approach in a constraint-driven system and use constraint solvers to efficiently join rules. Our experiments on many domains, including game playing and drug design, show that our approach can (i) learn rules with more than 100 literals, and (ii) drastically outperform existing approaches in terms of predictive accuracies.
List of keywords
Knowledge Representation and Reasoning -> KRR: Logic programming
Machine Learning -> ML: Symbolic methods
2677
Learning Logic Programs by Discovering Higher-Order Abstractions
Céline Hocquette, Sebastijan Dumancic, Andrew Cropper
6 min. talk | August 7th at 11:30 | Session: KRR: Logic programming
[+] More
[-] Less
We introduce the higher-order refactoring problem, where the goal is to compress a logic program by discovering higher-order abstractions, such as map, filter, and fold. We implement our approach in Stevie, which formulates the refactoring problem as a constraint optimisation problem. Our experiments on multiple domains, including program synthesis and visual reasoning, show that refactoring can improve the learning performance of an inductive logic programming system, specifically improving predictive accuracies by 27% and reducing learning times by 47%. We also show that Stevie can discover abstractions that transfer to multiple domains.
List of keywords
Knowledge Representation and Reasoning -> KRR: Logic programming
Machine Learning -> ML: Symbolic methods
2679
LitE-SNN: Designing Lightweight and Efficient Spiking Neural Network through Spatial-Temporal Compressive Network Search and Joint Optimization
Qianhui Liu, Jiaqi Yan, Malu Zhang, Gang Pan, Haizhou Li
6 min. talk | August 8th at 11:30 | Session: HAI: Cognitive modeling
[+] More
[-] Less
Spiking Neural Networks (SNNs) mimic the information-processing mechanisms of the human brain and are highly energy-efficient, making them well-suited for low-power edge devices. However, the pursuit of accuracy in current studies leads to large, long-timestep SNNs, conflicting with the resource constraints of these devices. In order to design lightweight and efficient SNNs, we propose a new approach named LitE-SNN that incorporates both spatial and temporal compression into the automated network design process. Spatially, we present a novel Compressive Convolution block (CompConv) to expand the search space to support pruning and mixed-precision quantization. Temporally, we are the first to propose a compressive timestep search to identify the optimal number of timesteps under specific computation cost constraints. Finally, we formulate a joint optimization to simultaneously learn the architecture parameters and spatial-temporal compression strategies to achieve high performance while minimizing memory and computation costs. Experimental results on CIFAR-10, CIFAR-100, and Google Speech Command datasets demonstrate our proposed LitE-SNNs can achieve competitive or even higher accuracy with remarkably smaller model sizes and fewer computation costs.
List of keywords
Humans and AI -> HAI: Cognitive modeling
Humans and AI -> HAI: Cognitive systems
2681
Theoretical Study on Multi-objective Heuristic Search
Shawn Skyler, Shahaf Shperberg, Dor Atzmon, Ariel Felner, Oren Salzman, Shao-Hung Chan, Han Zhang, Sven Koenig, William Yeoh, Carlos Hernandez Ulloa
12 min. talk | August 8th at 10:00 | Session: S: Search
[+] More
[-] Less
This paper provides a theoretical study on Multi-Objective Heuristic Search. We first classify states in the state space into must-expand, maybe-expand, and never-expand states and then transfer these definitions to nodes in the search tree. We then formalize a framework that generalizes A* to Multi-Objective Search. We study different ways to order nodes under this framework and their relation to traditional tie-breaking policies and provide theoretical findings. Finally, we study and empirically compare different ordering functions.
List of keywords
Search -> S: Heuristic search
Search -> S: Other
Search -> General
2684
Cross-Domain Feature Augmentation for Domain Generalization
Yingnan Liu, Yingtian Zou, Rui Qiao, Fusheng Liu, Mong Li Lee, Wynne Hsu
6 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
Domain generalization aims to develop models that are robust to distribution shifts. Existing methods focus on learning invariance across domains to enhance model robustness, and data augmentation has been widely used to learn invariant predictors, with most methods performing augmentation in the input space. However, augmentation in the input space has limited diversity whereas in the feature space is more versatile and has shown promising results. Nonetheless, feature semantics is seldom considered and existing feature augmentation methods suffer from a limited variety of augmented features. We decompose features into class-generic, class-specific, domain-generic, and domain-specific components. We propose a cross-domain feature augmentation method named XDomainMix that enables us to increase sample diversity while emphasizing the learning of invariant representations to achieve domain generalization. Experiments on widely used benchmark datasets demonstrate that our proposed method is able to achieve state-of-the-art performance. Quantitative analysis indicates that our feature augmentation approach facilitates the learning of effective models that are invariant across different domains.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Computer Vision -> CV: Machine learning for vision
2699
Semantics for Non-Flat Assumption-Based Argumentation, Revisited
Jesse Heyninck, Ofer Arieli
6 min. talk | August 6th at 15:00 | Session: KRR: Argumentation
[+] More
[-] Less
Assumption-based argumentation (ABA) is an argumentative formalism that allows for reasoning on the basis of defeasible assumptions and strict rules. Standard semantics for this formalism sometimes give rise to problematic behaviour in the presence of rules with assumptions in their heads. In this paper, we introduce a six-valued labelling semantics that overcomes these shortcomings while preserving all the usual properties of the standard Dung-style three-valued semantics for ABA frameworks, including existence of the complete semantics, uniqueness of the grounded semantics and preservation of the computational complexity of all main reasoning processes.
List of keywords
Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning
2704
CompetEvo: Towards Morphological Evolution from Competition
Kangyao Huang, Di Guo, Xinyu Zhang, Xiangyang Ji, Huaping Liu
6 min. talk | August 9th at 11:30 | Session: MAS: Agent-based and Multi-agent Systems (2/2)
[+] More
[-] Less
Training an agent to adapt to specific tasks through co-optimization of morphology and control has widely attracted attention. However, whether there exists an optimal configuration and tactics for agents in a multiagent competition scenario is still an issue that is challenging to definitively conclude. In this context, we propose competitive evolution (CompetEvo), which co-evolves agents’ designs and tactics in confrontation. We build arenas consisting of three animals and their evolved derivatives, placing agents with different morphologies in direct competition with each other. The results reveal that our method enables agents to evolve a more suitable design and strategy for fighting compared to fixed-morph agents, allowing them to obtain advantages in combat scenarios. Moreover, we demonstrate the amazing and impressive behaviors that emerge when confrontations are conducted under asymmetrical morphs.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Robotics -> ROB: Learning in robotics
Search -> S: Evolutionary computation
2707
TIM: An Efficient Temporal Interaction Module for Spiking Transformer
Sicheng Shen, Dongcheng Zhao, Guobin Shen, Yi Zeng
6 min. talk | August 8th at 11:30 | Session: HAI: Cognitive modeling
[+] More
[-] Less
Spiking Neural Networks (SNNs), as the third generation of neural networks, have gained prominence for their biological plausibility and computational efficiency, especially in processing diverse datasets. The integration of attention mechanisms, inspired by advancements in neural network architectures, has led to the development of Spiking Transformers. These have shown promise in enhancing SNNs’ capabilities, particularly in the realms of both static and neuromorphic datasets. Despite their progress, a discernible gap exists in these systems, specifically in the Spiking Self Attention (SSA) mechanism’s effectiveness in leveraging the temporal processing potential of SNNs. To address this, we introduce the Temporal Interaction Module (TIM), a novel, convolution-based enhancement designed to augment the temporal data processing abilities within SNN architectures. TIM’s integration into existing SNN frameworks is seamless and efficient, requiring minimal additional parameters while significantly boosting their temporal information handling capabilities. Through rigorous experimentation, TIM has demonstrated its effectiveness in exploiting temporal information, leading to state-of-the-art performance across various neuromorphic datasets. The code is available at https://github.com/BrainCog-X/Brain-Cog/tree/main/examples/TIM.
List of keywords
Humans and AI -> HAI: Cognitive modeling
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
Humans and AI -> HAI: Cognitive systems
2710
Learning Embeddings for Sequential Tasks Using Population of Agents
Mridul Mahajan, Georgios Tzannetos, Goran Radanovic, Adish Singla
12 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
We present an information-theoretic framework to learn fixed-dimensional embeddings for tasks in reinforcement learning. We leverage the idea that two tasks are similar if observing an agent’s performance on one task reduces our uncertainty about its performance on the other. This intuition is captured by our information-theoretic criterion which uses a diverse agent population as an approximation for the space of agents to measure similarity between tasks in sequential decision-making settings. In addition to qualitative assessment, we empirically demonstrate the effectiveness of our techniques based on task embeddings by quantitative comparisons against strong baselines on two application scenarios: predicting an agent’s performance on a new task by observing its performance on a small quiz of tasks, and selecting tasks with desired characteristics from a given set of options.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Representation learning
2717
MOSER: Learning Sensory Policy for Task-specific Viewpoint via View-conditional World Model
Shenghua Wan, Hai-Hang Sun, Le Gan, De-Chuan Zhan
6 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
Reinforcement learning from visual observations is a challenging problem with many real-world applications. Existing algorithms mostly rely on a single observation from a well-designed fixed camera that requires human knowledge. Recent studies learn from different viewpoints with multiple fixed cameras, but this incurs high computation and storage costs and may not guarantee the coverage of the optimal viewpoint. To alleviate these limitations, we propose a straightforward View-conditional Partially Observable Markov Decision Processes (VPOMDPs) assumption and develop a new method, the MOdel-based SEnsor controlleR (MOSER). MOSER jointly learns a view-conditional world model (VWM) to simulate the environment, a sensory policy to control the camera, and a motor policy to complete tasks. We design intrinsic rewards from the VWM without additional modules to guide the sensory policy to adjust the camera parameters. Experiments on locomotion and manipulation tasks demonstrate that MOSER autonomously discovers task-specific viewpoints and significantly outperforms most baseline methods.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Model-based and model learning reinforcement learning
Machine Learning -> ML: Partially observable reinforcement learning and POMDPs
2718
FBLG: A Local Graph Based Approach for Handling Dual Skewed Non-IID Data in Federated Learning
Yi Xu, Ying Li, Haoyu Luo, Xiaoliang Fan, Xiao Liu
12 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
In real-world situations, federated learning often needs to process non-IID (non-independent and identically distributed) data with multiple skews, causing inadequate model performance. Existing federated learning methods mainly focus on addressing the problem with a single skew of non-IID, and hence the performance of global models can be degraded when faced with dual skewed non-IID data caused by heterogeneous label distributions and sample sizes among clients. To address the problem with dual skewed non-IID data, in this paper, we propose a federated learning algorithm based on local graph, named FBLG. Specifically, to address the label distribution skew, we firstly construct a local graph based on clients’ local losses and Jensen-Shannon (JS) divergence, so that similar clients can be selected for aggregation to ensure a highly consistent global model. Afterwards, to address the sample size skew, we design the objective function to favor clients with more samples as models trained with more samples tend to carry more useful information. Experiments on four datasets with dual skewed non-IID data demonstrate FBLG outperforms nine baseline methods and achieves up to 9% improvement in accuracy. Simultaneously, both theoretical analysis and experiments show FBLG can converge quickly.
List of keywords
Machine Learning -> ML: Federated learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Optimization
Machine Learning -> ML: Supervised Learning
2729
Reconfigurability-Aware Selection for Contrastive Active Domain Adaptation
Zeyu Zhang, Chun Shen, Shuai Lü, Shaojie Zhang
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (3/6)
[+] More
[-] Less
Active domain adaptation (ADA) aims to label a small portion of target samples to drastically improve the adaptation performance. The existing ADA methods mostly rely on the output of domain discriminator or the original prediction probability to design sample selection strategies and do not fully explore the semantic information of source and target domain features, which may lead to selecting the valueless target samples. Moreover, most of them require complex network structures (such as introducing additional domain discriminator, multiple classifiers, or loss predictors) and multiple query functions. In this work, we propose a concise but effective ADA method called Reconfigurability-Aware Selection for Contrastive active domain adaptation (RASC). With the reconfigurability-aware sample selection strategy, RASC can select the most valuable target samples for annotation in the presence of domain shift. To better utilize the selected target samples, we further design a contrastive learning-based gradual active domain adaptation framework. In addition, we propose a variant of RASC called RASC-Ob, which uses a simpler sample annotation method and supplements the learning of misclassified samples. Extensive experimental results on multiple benchmarks demonstrate the superiority of RASC.
List of keywords
Machine Learning -> ML: Multi-task and transfer learning
Machine Learning -> ML: Semi-supervised learning
2744
Temporal Knowledge Graph Extrapolation via Causal Subhistory Identification
Kai Chen, Ye Wang, Xin Song, Siwei Chen, Han Yu, Aiping Li
6 min. talk | August 8th at 10:00 | Session: KRR: Learning and reasoning
[+] More
[-] Less
Temporal knowledge graph extrapolation has become a prominent area of study interest in recent years. Numerous methods for extrapolation have been put forth, mining query-relevant information from history to generate forecasts. However, existing approaches normally do not discriminate between causal and non-causal effects in reasoning; instead, they focus on analyzing the statistical correlation between the future events to be predicted and the historical data given, which may be deceptive and hinder the model’s capacity to learn real causal information that actually affects the reasoning conclusions. To tackle it, we propose a novel approach called Causal Subhistory Identification (CSI), which focuses on extracting the causal subhistory for reasoning purposes from a large amount of historical data. CSI can improve the clarity and transparency of the reasoning process and more effectively convey the logic behind conclusions by giving priority to the causal subhistory and eliminating non-causal correlations. Extensive experiments demonstrate the remarkable potential of our CSI in the following aspects: superiority, improvement, explainability, and robustness.
List of keywords
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Data Mining -> DM: Knowledge graphs and knowledge base completion
Natural Language Processing -> NLP: Applications
2752
Navigating Continual Test-time Adaptation with Symbiosis Knowledge
Xu Yang, Moqi Li, Jie Yin, Kun Wei, Cheng Deng
6 min. talk | August 8th at 11:30 | Session: ML: Unsupervised learning
[+] More
[-] Less
Continual test-time domain adaptation seeks to adapt the source pre-trained model to a continually changing target domain without incurring additional data acquisition or labeling costs. Unfortunately, existing mainstream methods may result in a detrimental cycle. This is attributed to noisy pseudo-labels caused by the domain shift, which immediately negatively impacts the model’s knowledge. The long-term accumulation of these negative effects exacerbates the model’s difficulty in generalizing to future domain shifts and contributes to catastrophic forgetting. To address these challenges, this paper introduces a Dual-stream Network that independently optimizes different parameters in each stream to capture symbiotic knowledge from continual domains, thereby ensuring generalization while enhancing instantaneous discrimination. Furthermore, to prevent catastrophic forgetting, a weighted soft parameter alignment method is designed to leverage knowledge from the source model. Finally, efforts are made to calibrate and explore reliable supervision signals to mitigate instantaneous negative optimization. These include label calibration with prior knowledge, label selection using self-adaptive confidence thresholds, and a soft-weighted contrastive module for capturing potential semantics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on several benchmark datasets.
List of keywords
Machine Learning -> ML: Unsupervised learning
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
2758
Equilibria in Two-Stage Facility Location with Atomic Clients
Simon Krogmann, Pascal Lenzner, Alexander Skopalik, Marc Uetz, Marnix C. Vos
12 min. talk | August 6th at 15:00 | Session: GTEP: Noncooperative games
[+] More
[-] Less
We consider competitive facility location as a two-stage multi-agent system with two types of clients. For a given host graph with weighted clients on the vertices, first facility agents strategically select vertices for opening their facilities. Then, the clients strategically select which of the opened facilities in their neighborhood to patronize. Facilities want to attract as much client weight as possible, clients want to minimize congestion on the chosen facility. All recently studied versions of this model assume that clients can split their weight strategically. We consider clients with unsplittable weights, but allow mixed strategies. So clients may randomize over which facility to patronize. Besides modeling a natural client behavior, this subtle change yields drastic changes, e.g., for a given facility placement, qualitatively different client equilibria are possible. As our main result, we show that pure subgame perfect equilibria always exist if all client weights are identical. For this, we use a novel potential function argument, employing a hierarchical classification of the clients and sophisticated rounding in each step. In contrast, for non-identical clients, we show that deciding the existence of even approximately stable states is computationally intractable. On the positive side, we give a tight bound of 2 on the price of anarchy which implies high social welfare of equilibria, if they exist.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
2759
Angluin-Style Learning of Deterministic Büchi and Co-Büchi Automata
Yong Li, Sven Schewe, Qiyi Tang
12 min. talk | August 8th at 15:00 | Session: ML: Machine Learning (6/6)
[+] More
[-] Less
While recently developed Angluin-style learning algorithms for omega-automata have much in common with her classic DFA learning algorithm, there is a huge difference in the cost of the equivalence queries about the target automata. For omega-regular languages, the target is to learn nondeterministic Buchi automata (NBAs) through the vehicle of Families of DFAs (FDFAs). While the cost of equivalence queries is usually idealised as constant in learning, it makes a practical difference that the language equivalence checking about the learned NBAs is computationally hard. We develop efficient techniques for the cases, where we learn deterministic Buchi automata (DBAs) or deterministic co-Buchi automata (DCAs). This is based on the observation that some classes of FDFAs can be used to learn DBAs for DBA recognisable languages, rather than having to resort to nondeterministic ones. We believe that the restriction to DBAs and DCAs in equivalence queries also makes our algorithm more appealing to realistic applications, as the operations are cheap—NL—for DBAs and DCAs.
List of keywords
Machine Learning -> ML: Active learning
Knowledge Representation and Reasoning -> KRR: Automated reasoning and theorem proving
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Machine Learning -> ML: Model-based and model learning reinforcement learning
2765
Compilation and Fast Model Counting beyond CNF
Alexis de Colnet, Stefan Szeider, Tianwei Zhang
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
Circuits in deterministic decomposable negation normal form (d-DNNF) are representations of Boolean functions that enable linear-time model counting. This paper strengthens our theoretical knowledge of what classes of functions can be efficiently transformed, or compiled, into d-DNNF. Our main contribution is the fixed-parameter tractable (FPT) compilation of conjunctions of specific constraints parameterized by incidence treewidth. This subsumes the known result for CNF. The constraints in question are all functions representable by constant-width ordered binary decision diagrams (OBDDs) for all variable orderings. For instance, this includes parity constraints and cardinality constraints with constant threshold. The running time of the FPT compilation is singly exponential in the incidence treewidth but hides large constants in the exponent. To balance that, we give a more efficient FPT algorithm for model counting that applies to a sub-family of the constraints and does not require compilation.
List of keywords
Knowledge Representation and Reasoning -> KRR: Knowledge compilation
2768
The Orthogonality of Weight Vectors: The Key Characteristics of Normalization and Residual Connections
Zhixing Lu, Yuanyuan Sun, Zhihao Yang, Qin Zhou, Hongfei Lin
6 min. talk | August 8th at 11:30 | Session: ML: Explainable/Interpretable machine learning
[+] More
[-] Less
Normalization and residual connections find extensive application within the intricate architecture of deep neural networks, contributing significantly to their heightened performance. Nevertheless, the precise factors responsible for this elevated performance have remained elusive. Our theoretical investigations have unveiled a noteworthy revelation: the utilization of normalization and residual connections results in an enhancement of the orthogonality within the weight vectors of deep neural networks. This, in turn, induces the Gram matrix of neural network weights to exhibit a pronounced tendency towards strict diagonal dominance, thereby amplifying the neural network’s capacity for feature learning. Meanwhile, we have designed the parameters independence index (PII) to precisely characterize the orthogonality of parameter vectors. In tandem with our theoretical findings, we undertook empirical validations through experiments conducted on prevalent network models, including fully connected networks (FNNs), convolutional neural networks (CNNs), Transformers, pre-trained language models(PLMs) and large language models (LLMs) composed of Transformers. Finally, we have found that a fine-tuning technique (LoRA) preserves the orthogonality of parameter vectors, a revelation that carries importance within the framework of fine-tuning techniques for LLMs.
List of keywords
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Deep learning architectures
2773
Fast and Continual Knowledge Graph Embedding via Incremental LoRA
Jiajun Liu, Wenjun Ke, Peng Wang, Jiahao Wang, Jinhua Gao, Ziyu Shang, Guozheng Li, Zijie Xu, Ke Ji, Yining Li
6 min. talk | August 7th at 15:00 | Session: DM: Data Mining (1/2)
[+] More
[-] Less
Continual Knowledge Graph Embedding (CKGE) aims to efficiently learn new knowledge and simultaneously preserve old knowledge. Dominant approaches primarily focus on alleviating catastrophic forgetting of old knowledge but neglect efficient learning for the emergence of new knowledge. However, in real-world scenarios, knowledge graphs (KGs) are continuously growing, which brings a significant challenge to fine-tuning KGE models efficiently. To address this issue, we propose a fast CKGE framework (FastKGE), incorporating an incremental low-rank adapter (IncLoRA) mechanism to efficiently acquire new knowledge while preserving old knowledge. Specifically, to mitigate catastrophic forgetting, FastKGE isolates and allocates new knowledge to specific layers based on the fine-grained influence between old and new KGs. Subsequently, to accelerate fine-tuning, FastKGE devises an efficient IncLoRA mechanism, which embeds the specific layers into incremental low-rank adapters with fewer training parameters. Moreover, IncLoRA introduces adaptive rank allocation, which makes the LoRA aware of the importance of entities and adjusts its rank scale adaptively. We conduct experiments on four public datasets and two new datasets with a larger initial scale. Experimental results demonstrate that FastKGE can reduce training time by 34%-49% while still achieving competitive link prediction performance against state-of-the-art models on four public datasets (average MRR score of 21.0% vs. 21.1%). Meanwhile, on two newly constructed datasets, FastKGE saves 51%-68% training time and improves link prediction performance by 1.5%.
List of keywords
Data Mining -> DM: Knowledge graphs and knowledge base completion
2778
Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition
Bangbang Zhou, Yadong Qu, Zixiao Wang, Zicheng Li, Boqiang Zhang, Hongtao Xie
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Recently, scene text recognition (STR) models have shown significant performance improvements. However, existing models still encounter difficulties in recognizing challenging texts that involve factors such as severely distorted and perspective characters. These challenging texts mainly cause two problems: (1) Large Intra-Class Variance. (2) Small Inter-Class Variance. An extremely distorted character may prominently differ visually from other characters within the same category, while the variance between characters from different classes is relatively small. To address the above issues, we propose a novel method that enriches the character features to enhance the discriminability of characters. Firstly, we propose the Character-Aware Constraint Encoder (CACE) with multiple blocks stacked. CACE introduces a decay matrix in each block to explicitly guide the attention region for each token. By continuously employing the decay matrix, CACE enables tokens to perceive morphological information at the character level. Secondly, an Intra-Inter Consistency Loss (I^2CL) is introduced to consider intra-class compactness and inter-class separability at feature space. I^2CL improves the discriminative capability of features by learning a long-term memory unit for each character category. Trained with synthetic data, our model achieves state-of-the-art performance on common benchmarks (94.1% accuracy) and Union14M-Benchmark (61.6% accuracy). Code is available at https://github.com/bang123-box/CFE.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Multimodal learning
2780
Improved Approximation of Weighted MMS Fairness for Indivisible Chores
Fangxiao Wang, Bo Li, Pinyan Lu
6 min. talk | August 8th at 10:00 | Session: GTEP: Fair division
[+] More
[-] Less
We study how to fairly allocate a set of indivisible chores among n agents who may have different weights corresponding to their involvement in completing these chores. We found that some of the existing fairness notions may place agents with lower weights at a disadvantage, which motivates us to explore weighted maximin share fairness (WMMS). While it is known that a WMMS allocation may not exist, no non-trivial approximation has been discovered thus far. In this paper, we first design a simple sequential picking algorithm that solely relies on the agents’ ordinal rankings of the items, which achieves an approximation ratio of O(log n). Then, for the case involving two agents, we improve the approximation ratio to (√3+1)/2 ≈1.366, and prove that it is optimal. We also consider an online setting when the items arrive one after another and design an O(√n)-competitive online algorithm given the valuations are normalized
List of keywords
Game Theory and Economic Paradigms -> GTEP: Fair division
2782
Online Learning with Off-Policy Feedback in Adversarial MDPs
Francesco Bacchiocchi, Francesco Emanuele Stradi, Matteo Papini, Alberto Maria Metelli, Nicola Gatti
12 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (4/6)
[+] More
[-] Less
In this paper, we face the challenge of online learning in adversarial Markov decision processes with off-policy feedback. In this setting, the learner chooses a policy, but, differently from the traditional on-policy setting, the environment is explored by means of a different, fixed, and possibly unknown policy (named colleague’s policy). The off-policy feedback presents an additional issue that is not present in traditional settings: the learner is charged with the regret of its chosen policy but it observes only the rewards gained by the colleague’s policy. First, we present a lower-bound for the setting we propose, which shows that the optimal dependency of the sublinear regret is w.r.t. the dissimilarity between the optimal policy in hindsight and the colleague’s policy. Then, we propose novel algorithms that, by employing pessimistic estimators—commonly adopted in the off-line reinforcement learning literature—ensure sublinear regret bounds depending on the desired dissimilarity, even when the colleague’s policy is unknown.
List of keywords
Machine Learning -> ML: Online learning
Machine Learning -> ML: Reinforcement learning
2785
Optimizing Viscous Democracy
Ben Armstrong, Shiri Alouf-Heffetz, Nimrod Talmon
6 min. talk | August 7th at 15:00 | Session: GTEP: Computational social choice (2/2)
[+] More
[-] Less
Viscous democracy is a generalization of liquid democracy, a social choice framework in which voters may transitively delegate their votes. In viscous democracy, a "viscosity" factor decreases the weight of a delegation the further it travels, reducing the chance of excessive weight flowing between ideologically misaligned voters. We demonstrate that viscous democracy often significantly improves the quality of group decision-making over liquid democracy. We first show that finding optimal delegations within a viscous setting is NP-hard. However, simulations allow us to explore the practical effects of viscosity. Across social network structures, competence distributions, and delegation mechanisms we find high viscosity reduces the chance of “super-voters” attaining large amounts of weight and increases the number of voters that are able to affect the outcome of elections. This, in turn, improves group accuracy as a whole. As a result, we argue that viscosity should be considered a core component of liquid democracy.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Multidisciplinary Topics and Applications -> MTA: Web and social networks
2802
Reschedule Diffusion-based Bokeh Rendering
Shiyue Yan, Xiaoshi Qiu, Qingmin Liao, Jing-Hao Xue, Shaojun Liu
6 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
Bokeh rendering for images shot with small apertures has drawn much attention in practice. Very recently people start to explore diffusion models for bokeh rendering, aiming to leverage the models’ surging power of image generation. However, we can clearly observe two big issues with the images rendered by diffusion models: large fluctuation and severe color deviation. To address these issues, we propose in this paper a prior-aware sampling approach, which can adaptively control the noise scale through learned priors, and a prior-aware noise scheduling strategy, which can greatly reduce the number of inference steps without sacrificing performance. Extensive experiments show that our method can effectively alleviate the fluctuation problem of sampling results while ensuring similar color styles to the input image. In addition, our method outperforms state-of-the-art methods, sometimes even with only two steps of sampling. Our code is available at https://github.com/Loeiii/Reschedule-Diffusion-based-Bokeh-Rendering.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Humans and AI -> HAI: Applications
Machine Learning -> ML: Applications
2809
Cross-Scale Domain Adaptation with Comprehensive Information for Pansharpening
Meiqi Gong, Hao Zhang, Hebaixu Wang, Jun Chen, Jun Huang, Xin Tian, Jiayi Ma
6 min. talk | August 7th at 10:00 | Session: CV: Multimodal learning
[+] More
[-] Less
Deep learning-based pansharpening methods typically use simulated data at the reduced-resolution scale for training. It limits their performance when generalizing the trained model to the full-resolution scale due to incomprehensive information utilization of panchromatic (PAN) images at the full-resolution scale and low generalization ability. In this paper, we adopt two targeted strategies to address the above two problems. On the one hand, we introduce a cross-scale comprehensive information capture module, which improves the information utilization of the original PAN image through fully-supervised reconstruction. On the other hand, we pioneer a domain adaptation strategy to tackle the problem of low generalization across different scales. Considering the instinct domain gap between different scales, we leverage the maximum mean discrepancy loss and the inherent pixel-level correlations between features at different scales to reduce the scale variance, thus boosting the generalization ability of our model. Experiments on various satellites demonstrate the superiority of our method over the state-of-the-arts in terms of information retention. Our code is publicly available at https://github.com/Meiqi-Gong/SDIPS.
List of keywords
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Computational photography
2821
Learning Pareto Set for Multi-Objective Continuous Robot Control
Tianye Shu, Ke Shang, Cheng Gong, Yang Nan, Hisao Ishibuchi
12 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Optimization
Robotics -> ROB: Learning in robotics
2823
Towards Geometric Normalization Techniques in SE(3) Equivariant Graph Neural Networks for Physical Dynamics Simulations
Ziqiao Meng, Liang Zeng, Zixing Song, Tingyang Xu, Peilin Zhao, Irwin King
12 min. talk | August 7th at 11:30 | Session: MTA: Physical sciences
[+] More
[-] Less
SE(3) equivariance is a fundamental property that is highly desirable to maintain in physical dynamics modeling. This property ensures neural outputs to remain robust when the inputs are translated or rotated. Recently, there have been several proposals for SE(3) equivariant graph neural networks (GNNs) that have shown promising results in simulating particle dynamics. However, existing works have neglected an important issue that current SE(3) equivariant GNNs cannot scale to large particle systems. Although some simple normalization techniques are already in use to stabilize the training dynamics of equivariant graph networks, they actually break the SE(3) equivariance of the architectures. In this work, we first show the numerical instability of training equivariant GNNs on large particle systems and then analyze some existing normalization strategies adopted in modern works. We propose a new normalization layer called GeoNorm, which can satisfy the SE(3) equivariance and simultaneously stabilize the training process. We conduct comprehensive experiments on N-body system simulation tasks with larger particle system sizes. The experimental results demonstrate that GeoNorm successfully preserves the SE(3) equivariance compared to baseline techniques and stabilizes the training dynamics of SE(3) equivariant GNNs on large systems.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Physical sciences
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Geometric learning
Machine Learning -> ML: Sequence and graph learning
2836
FedGCS: A Generative Framework for Efficient Client Selection in Federated Learning via Gradient-based Optimization
Zhiyuan Ning, Chunlin Tian, Meng Xiao, Wei Fan, Pengyang Wang, Li Li, Pengfei Wang, Yuanchun Zhou
6 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
Federated Learning faces significant challenges in statistical and system heterogeneity, along with high energy consumption, necessitating efficient client selection strategies. Traditional approaches, including heuristic and learning-based methods, fall short of addressing these complexities holistically. In response, we propose FedGCS, a novel generative client selection framework that innovatively recasts the client selection process as a generative task. Drawing inspiration from the methodologies used in large language models, FedGCS efficiently encodes abundant decision-making knowledge within a continuous representation space, enabling efficient gradient-based optimization to search for optimal client selection that will be finally output via generation. The framework comprises four steps: (1) automatic collection of diverse “selection-score” pair data using classical client selection methods; (2) training an encoder-evaluator-decoder framework on this data to construct a continuous representation space; (3) employing gradient-based optimization in this space for optimal client selection; (4) generating the final optimal client selection via using beam search for the well-trained decoder. FedGCS outperforms traditional methods by being more comprehensive, generalizable, and efficient, simultaneously optimizing for model performance, latency, and energy consumption. The effectiveness of FedGCS is proven through extensive experimental analyses.
List of keywords
Machine Learning -> ML: Federated learning
Data Mining -> DM: Applications
Data Mining -> DM: Other
Machine Learning -> ML: Applications
2851
Extremal Separation Problems for Temporal Instance Queries
Jean Christoph Jung, Vladislav Ryzhikov, Frank Wolter, Michael Zakharyaschev
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
The separation problem for a class Q of database queries is to find a query in Q that distinguishes between a given set of ‘positive’ and ‘negative’ data examples. Separation provides explanations of examples and underpins the query-by-example paradigm to support database users in constructing and refining queries. As the space of all separating queries can be large, it is helpful to succinctly represent this space by means of its most specific (logically strongest) and general (weakest) members. We investigate this extremal separation problem for classes of instance queries formulated in linear temporal logic LTL with the operators conjunction, ‘next’, and ‘eventually’. Our results range from tight complexity bounds for verifying and counting extremal separators to algorithms computing them.
List of keywords
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Multidisciplinary Topics and Applications -> MTA: Databases
2852
Boosting Single Positive Multi-label Classification with Generalized Robust Loss
Yanxi Chen, Chunxiao Li, Xinyang Dai, Jinhuan Li, Weiyu Sun, Yiming Wang, Renyuan Zhang, Tinghe Zhang, Bo Wang
6 min. talk | August 9th at 11:30 | Session: ML: Multi-label learning
[+] More
[-] Less
Multi-label learning (MLL) requires comprehensive multi-semantic annotations that is hard to fully obtain, thus often resulting in missing labels scenarios. In this paper, we investigate Single Positive Multi-label Learning (SPML), where each image is associated with merely one positive label. Existing SPML methods only focus on designing losses using mechanisms such as hard pseudo-labeling and robust losses, mostly leading to unacceptable false negatives. To address this issue, we first propose a generalized loss framework based on expected risk minimization to provide soft pseudo labels, and point out that the former losses can be seamlessly converted into our framework. In particular, we design a novel robust loss based on our framework, which enjoys flexible coordination between false positives and false negatives, and can additionally deal with the imbalance between positive and negative samples. Extensive experiments show that our approach can significantly improve SPML performance and outperform the vast majority of state-of-the-art methods on all the four benchmarks. Our code is available at https://github.com/yan4xi1/GRLoss.
List of keywords
Machine Learning -> ML: Multi-label learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Weakly supervised learning
2859
SketchEdit: Editing Freehand Sketches at the Stroke-Level
Tengjie Li, Shikui Tu, Lei Xu
6 min. talk | August 9th at 11:30 | Session: ML: Generative models
[+] More
[-] Less
Recent sketch synthesis methods have demonstrated the capability of generating lifelike outcomes. However, these methods directly encode the entire sketches making it challenging to decouple the strokes from the sketches and have difficulty in controlling local sketch synthesis, e.g., stroke editing. Besides, the sketch editing task encounters the issue of accurately positioning the edited strokes, because users may not be able to draw on the exact position and the same stroke may appear in various locations in different sketches. We propose SketchEdit to realize flexible editing of sketches at the stroke-level for the first time. To tackle the challenge of decoupling strokes, SketchEdit divides a drawing sequence of a sketch into a series of strokes based on the pen state, aligns the stroke segments to have the same starting position, and learns the embeddings of every stroke by a proposed stroke encoder. Moreover, we overcome the problem of stroke placement via a diffusion process, which progressively generates the locations for the strokes to be synthesized, using the stroke features as the guiding condition. Experiments demonstrate that SketchEdit is effective for stroke-level sketch editing and sketch reconstruction. The source code is publicly available at https://github.com/CMACH508/SketchEdit/.
List of keywords
Machine Learning -> ML: Generative models
Computer Vision -> CV: Applications
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Representation learning
2864
Large Language Model Guided Knowledge Distillation for Time Series Anomaly Detection
Chen Liu, Shibo He, Qihang Zhou, Shizhong Li, Wenchao Meng
6 min. talk | August 8th at 11:30 | Session: DM: Anomaly/outlier detection
[+] More
[-] Less
Self-supervised methods have gained prominence in time series anomaly detection due to the scarcity of available annotations. Nevertheless, they typically demand extensive training data to acquire a generalizable representation map, which conflicts with scenarios of a few available samples, thereby limiting their performance. To overcome the limitation, we propose AnomalyLLM, a knowledge distillation-based time series anomaly detection approach where the student network is trained to mimic the features of the large language model (LLM)-based teacher network that is pretrained on large-scale datasets. During the testing phase, anomalies are detected when the discrepancy between the features of the teacher and student networks is large. To circumvent the student network from learning the teacher network’s feature of anomalous samples, we devise two key strategies. 1) Prototypical signals are incorporated into the student network to consolidate the normal feature extraction. 2) We use synthetic anomalies to enlarge the representation gap between the two networks. AnomalyLLM demonstrates state-of-the-art performance on 15 datasets, improving accuracy by at least 14.5% in the UCR dataset.
List of keywords
Data Mining -> DM: Anomaly/outlier detection
Data Mining -> DM: Mining spatial and/or temporal data
Machine Learning -> ML: Unsupervised learning
2865
The Transformation Logics
Alessandro Ronca
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
We introduce a new family of temporal logics designed to finely balance the trade-off between expressivity and complexity. Their key feature is the possibility of defining operators of a new kind that we call transformation operators. Some of them subsume existing temporal operators, while others are entirely novel. Of particular interest are transformation operators based on semigroups. They enable logics to harness the richness of semigroup theory, and we show them to yield logics capable of creating hierarchies of increasing expressivity and complexity which are non-trivial to characterise in existing logics. The result is a genuinely novel and yet unexplored landscape of temporal logics, each of them with the potential of matching the trade-off between expressivity and complexity required by specific applications.
List of keywords
Knowledge Representation and Reasoning -> KRR: Knowledge representation languages
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
Knowledge Representation and Reasoning -> KRR: Qualitative, geometric, spatial, and temporal reasoning
2875
First-Order Progression beyond Local-Effect and Normal Actions
Daxin Liu, Jens Claßen
6 min. talk | August 9th at 11:30 | Session: KRR: Reasoning about actions
[+] More
[-] Less
One of the fundamental problems in reasoning about action is progression, which is to update a knowledge base according to the effects of an action into another knowledge base that retains all proper information. The problem is notoriously challenging, as in general, it requires second-order logic. Efforts have been made to find fragments where progression is first-order definable. Liu and Lakemeyer showed that for actions that have only local effects, progression is always first-order definable. They also generalized the result to so-called normal actions, that allow for non-local effects, as long as the affected fluent predicates only depend on local-effect ones, under certain restrictions on the knowledge base. In addition, they showed that for so-called proper+ knowledge bases, progression for normal actions can be efficient under reasonable assumptions. In this paper, we consider a larger class of theories, called the acyclic ones, that strictly subsumes normal actions. In such theories, dependencies between non-local effect fluent predicates are allowed, as long as they do not contain any cycles. We prove progression to be equally first-order definable for this class. Furthermore, under similar but stronger assumptions than those made by Liu and Lakemeyer, we show that progression is efficient as well.
List of keywords
Knowledge Representation and Reasoning -> KRR: Reasoning about actions
2878
BATON: Aligning Text-to-Audio Model Using Human Preference Feedback
Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, Xiu Li
6 min. talk | August 8th at 10:00 | Session: NLP: Speech
[+] More
[-] Less
With the development of AI-Generated Content (AIGC), text-to-audio models are gaining widespread attention. However, it is challenging for these models to generate audio aligned with human preference due to the inherent information density of natural language and limited model understanding ability. To alleviate this issue, we formulate the BATON, the first framework specifically designed to enhance the alignment between generated audio and text prompt using human preference feedback. Our BATON comprises three key stages: Firstly, we curated a dataset containing both prompts and the corresponding generated audio, which was then annotated based on human feedback. Secondly, we introduced a reward model using the constructed dataset, which can mimic human preference by assigning rewards to input text-audio pairs. Finally, we employed the reward model to fine-tune an off-the-shelf text-to-audio model. The experiment results demonstrate that our BATON can significantly improve the generation quality of the original text-to-audio models, concerning audio integrity, temporal relationship, and alignment with human preference. Project page is available at https://baton2024.github.io.
List of keywords
Machine Learning -> ML: Generative models
Multidisciplinary Topics and Applications -> MTA: Arts and creativity
Natural Language Processing -> NLP: Speech
2879
Truthful Interval Covering
Argyrios Deligkas, Aris Filos-Ratsikas, Alexandros A. Voudouris
6 min. talk | August 8th at 11:30 | Session: GTEP: Mechanism design
[+] More
[-] Less
We initiate the study of a novel problem in mechanism design without money, which we term Truthful Interval Covering (TIC). An instance of TIC consists of a set of agents each associated with an individual interval on a line, and the objective is to decide where to place a covering interval to minimize the total social or egalitarian cost of the agents, which is determined by the intersection of this interval with their individual ones. This fundamental problem can model situations of provisioning a public good, such as the use of power generators to prevent or mitigate load shedding in developing countries. In the strategic version of the problem, the agents wish to minimize their individual costs, and might misreport the position and/or length of their intervals to achieve that. Our goal is to design truthful mechanisms to prevent such strategic misreports and achieve good approximations to the best possible social or egalitarian cost. We consider the fundamental setting of known intervals with equal lengths and provide tight bounds on the approximation ratios achieved by truthful deterministic mechanisms. For the social cost, we also design a randomized truthful mechanism that outperforms all possible deterministic ones. Finally, we highlight a plethora of natural extensions of our model for future work, as well as some natural limitations of those settings.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Mechanism design
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2885
Revisiting Causal Discovery from a Complexity-Theoretic Perspective
Robert Ganian, Viktoriia Korchemna, Stefan Szeider
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
Causal discovery seeks to unveil causal relationships (represented as a so-called causal graph) from observational data. This paper investigates the complex relationship between the graph structure and the efficiency of constraint-based causal discovery algorithms. Our main contributions include (i) a near-tight characterization of which causal graphs admit a small d-separating set for each pair of vertices and thus can potentially be efficiently recovered by a constraint-based causal discovery algorithm, (ii) the explicit construction of a sequence of causal graphs on which the influential PC algorithm might need exponential time, although there is a small d-separating set between every pair of variables, and (iii) the formulation of a new causal discovery algorithm which achieves fixed-parameter running time by considering the maximum number of edge-disjoint paths between variables in the (undirected) super-structure as the parameter. A distinguishing feature of our investigation is that it is carried out within a more fine-grained model which more faithfully captures the infeasibility of performing accurate independence tests for large sets of conditioning variables.
List of keywords
Knowledge Representation and Reasoning -> KRR: Causality
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
2901
Vision-fused Attack: Advancing Aggressive and Stealthy Adversarial Text against Neural Machine Translation
Yanni Xue, Haojie Hao, Jiakai Wang, Qiang Sheng, Renshuai Tao, Yu Liang, Pu Feng, Xianglong Liu
6 min. talk | August 9th at 11:30 | Session: NLP: Natural Language Processing (3/3)
[+] More
[-] Less
While neural machine translation (NMT) models achieve success in our daily lives, they show vulnerability to adversarial attacks. Despite being harmful, these attacks also offer benefits for interpreting and enhancing NMT models, thus drawing increased research attention. However, existing studies on adversarial attacks are insufficient in both attacking ability and human imperceptibility due to their sole focus on the scope of language. This paper proposes a novel vision-fused attack (VFA) framework to acquire powerful adversarial text, i.e., more aggressive and stealthy. Regarding the attacking ability, we design the vision-merged solution space enhancement strategy to enlarge the limited semantic solution space, which enables us to search for adversarial candidates with higher attacking ability. For human imperceptibility, we propose the perception-retained adversarial text selection strategy to align the human text-reading mechanism. Thus, the finally selected adversarial text could be more deceptive. Extensive experiments on various models, including large language models (LLMs) like LLaMA and GPT-3.5, strongly support that VFA outperforms the comparisons by large margins (up to 81%/14% improvements on ASR/SSIM).
List of keywords
Natural Language Processing -> NLP: Machine translation and multilinguality
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Machine Learning -> ML: Trustworthy machine learning
2906
Enhanced DouDiZhu Card Game Strategy Using Oracle Guiding and Adaptive Deep Monte Carlo Method
Qian Luo, Tien Ping Tan, Daochen Zha, Tianqiao Zhang
6 min. talk | August 9th at 10:00 | Session: MTA: Multidisciplinary Topics and Applications (2/2)
[+] More
[-] Less
Deep Reinforcement Learning (DRL) exhibits significant advancements in games with both perfect and imperfect information, such as Go, Chess, Texas Hold’em, and Dota2. However, DRL encounters considerable challenges when tackling card game DouDiZhu because of the imperfect information, large state-action space, and the sparse reward issue. This paper presents OADMCDou, which combines Oracle Guiding and Adaptive Deep Monte Carlo Method to address the challenges in DouDiZhu. Oracle Guiding trains an Oracle agent with both imperfect and perfect information, gradually reducing reliance on imperfect information to transition to a standard agent. Adaptive Deep Monte Carlo uses gradient weight clipping and constrains the magnitude of updates to prevent extreme policy updates. We conduct extensive experiments to evaluate the effectiveness of the proposed methods, demonstrating OADMCDou’s superior performance over the state-of-the-art DouDiZhu AI, DouZero. This superiority over DouZero is reflected in two metrics: a 95% confidence interval of 0.104 ± 0.041 for performance, and a 28.6% reduction in loss.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Entertainment
Multidisciplinary Topics and Applications -> MTA: Computer games
Multidisciplinary Topics and Applications -> MTA: Game playing
Agent-based and Multi-agent Systems -> MAS: Other
2920
Zero-shot High-fidelity and Pose-controllable Character Animation
Bingwen Zhu, Fanyi Wang, Tianyi Lu, Peng Liu, Jingwen Su, Jinxiu Liu, Yanhao Zhang, Zuxuan Wu, Guo-Jun Qi, Yu-Gang Jiang
6 min. talk | August 8th at 11:30 | Session: CV: Image and video synthesis and generation (1/2)
[+] More
[-] Less
Image-to-video (I2V) generation aims to create a video sequence from a single image, which requires high temporal coherence and visual fidelity. However, existing approaches suffer from inconsistency of character appearances and poor preservation of fine details. Moreover, they require a large amount of video data for training, which can be computationally demanding. To address these limitations, we propose PoseAnimate, a novel zero-shot I2V framework for character animation. PoseAnimate contains three key components: 1) a Pose-Aware Control Module (PACM) that incorporates diverse pose signals into text embeddings, to preserve character-independent content and maintain precise alignment of actions. 2) a Dual Consistency Attention Module (DCAM) that enhances temporal consistency and retains character identity and intricate background details. 3) a Mask-Guided Decoupling Module (MGDM) that refines distinct feature perception abilities, improving animation fidelity by decoupling the character and background. We also propose a Pose Alignment Transition Algorithm (PATA) to ensure smooth action transition. Extensive experiment results demonstrate that our approach outperforms the state-of-the-art training-based methods in terms of character consistency and detail fidelity. Moreover, it maintains a high level of temporal coherence throughout the generated animations.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Machine Learning -> ML: Multi-modal learning
2924
Evaluation of Project Performance in Participatory Budgeting
Niclas Boehmer, Piotr Faliszewski, Łukasz Janeczko, Dominik Peters, Grzegorz Pierczyński, Šimon Schierreich, Piotr Skowron, Stanisław Szufa
6 min. talk | August 6th at 11:30 | Session: GTEP: Computational social choice (1/2)
[+] More
[-] Less
We study ways of evaluating the performance of losing projects in participatory budgeting (PB) elections by seeking actions that would make them win. We focus on lowering their costs, obtaining additional approvals, and removing approvals for competing projects: The larger a change is needed, the less successful is the given project. We seek efficient algorithms for computing our measures and we analyze them experimentally, focusing on GreedyAV, Phragmen, and Equal-Shares PB rules.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2925
Geometry-Guided Conditional Adaptation for Surrogate Models of Large-Scale 3D PDEs on Arbitrary Geometries
Jingyang Deng, Xingjian Li, Haoyi Xiong, Xiaoguang Hu, Jinwen Ma
6 min. talk | August 7th at 11:30 | Session: MTA: Physical sciences
[+] More
[-] Less
Deep learning surrogate models aim to accelerate the solving of partial differential equations (PDEs) and have achieved certain promising results. Although several main-stream models through neural operator learning have been applied to delve into PDEs on varying geometries, they were designed to map the complex geometry to a latent uniform grid, which is still challenging to learn by the networks with general architectures. In this work, we rethink the critical factors of PDE solutions and propose a novel model-agnostic framework, called 3D Geometry-Guided Conditional adaptation (3D-GeoCA), for solving PDEs on arbitrary 3D geometries. Starting with a 3D point cloud geometry encoder, 3D-GeoCA can extract the essential and robust representations of any kind of geometric shapes, which conditionally guides the adaptation of hidden features in the surrogate model. We conduct experiments on two public computational fluid dynamics datasets, the Shape-Net Car and Ahmed-Body dataset, using several surrogate models as the backbones with various point cloud geometry encoders to simulate corresponding large-scale Reynolds Average Navier-Stokes equations. Equipped with 3D-GeoCA, these backbone models can reduce the L-2 error by a large margin. Moreover, this 3D-GeoCA is model-agnostic so that it can be applied to any surrogate model.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Physical sciences
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Machine Learning -> ML: Supervised Learning
2929
MARCO: A Memory-Augmented Reinforcement Framework for Combinatorial Optimization
Andoni I. Garmendia, Quentin Cappart, Josu Ceberio, Alexander Mendiburu
12 min. talk | August 7th at 11:30 | Session: S: Combinatorial search and optimisation
[+] More
[-] Less
Neural Combinatorial Optimization (NCO) is an emerging domain where deep learning techniques are employed to address combinatorial optimization problems as a standalone solver. Despite their potential, existing NCO methods often suffer from inefficient search space exploration, frequently leading to local optima entrapment or redundant exploration of previously visited states. This paper introduces a versatile framework, referred to as Memory-Augmented Reinforcement for Combinatorial Optimization (MARCO), that can be used to enhance both constructive and improvement methods in NCO through an innovative memory module. MARCO stores data collected throughout the optimization trajectory and retrieves contextually relevant information at each state. This way, the search is guided by two competing criteria: making the best decision in terms of the quality of the solution and avoiding revisiting already explored solutions. This approach promotes a more efficient use of the available optimization budget. Moreover, thanks to the parallel nature of NCO models, several search threads can run simultaneously, all sharing the same memory module, enabling an efficient collaborative exploration. Empirical evaluations, carried out on the maximum cut, maximum independent set and travelling salesman problems, reveal that the memory module effectively increases the exploration, enabling the model to discover diverse, higher-quality solutions. MARCO achieves good performance in a low computational cost, establishing a promising new direction in the field of NCO.
List of keywords
Search -> S: Combinatorial search and optimisation
Search -> S: Search and machine learning
2937
Guidance Graph Optimization for Lifelong Multi-Agent Path Finding
Yulun Zhang, He Jiang, Varun Bhatt, Stefanos Nikolaidis, Jiaoyang Li
6 min. talk | August 7th at 11:30 | Session: MAS: Agent-based and Multi-agent Systems (1/2)
[+] More
[-] Less
We study how to use guidance to improve the throughput of lifelong Multi-Agent Path Finding (MAPF). Previous studies have demonstrated that, while incorporating guidance, such as highways, can accelerate MAPF algorithms, this often results in a trade-off with solution quality. In addition, how to generate good guidance automatically remains largely unexplored, with current methods falling short of surpassing manually designed ones. In this work, we introduce the guidance graph as a versatile representation of guidance for lifelong MAPF, framing Guidance Graph Optimization as the task of optimizing its edge weights. We present two GGO algorithms to automatically generate guidance for arbitrary lifelong MAPF algorithms and maps. The first method directly optimizes edge weights, while the second method optimizes an update model capable of generating edge weights. Empirically, we show that (1) our guidance graphs improve the throughput of three representative lifelong MAPF algorithms in eight benchmark maps, and (2) our update model can generate guidance graphs for as large as 93 x 91 maps and as many as 3,000 agents. We include the source code at: https://github.com/lunjohnzhang/ggo_public. All optimized guidance graphs are available online at: https://yulunzhang.net/publication/zhang2024ggo.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Multi-agent planning
Planning and Scheduling -> PS: Applications
Robotics -> ROB: Multi-robot systems
Search -> S: Evolutionary computation
2939
Capturing Knowledge Graphs and Rules with Octagon Embeddings
Victor Charpenay, Steven Schockaert
6 min. talk | August 8th at 10:00 | Session: KRR: Learning and reasoning
[+] More
[-] Less
Region based knowledge graph embeddings represent relations as geometric regions. This has the advantage that the rules which are captured by the model are made explicit, making it straightforward to incorporate prior knowledge and to inspect learned models. Unfortunately, existing approaches are severely restricted in their ability to model relational composition, and hence also their ability to model rules, thus failing to deliver on the main promise of region based models. With the aim of addressing these limitations, we investigate regions which are composed of axis-aligned octagons. Such octagons are particularly easy to work with, as intersections and compositions can be straightforwardly computed, while they are still sufficiently expressive to model arbitrary knowledge graphs. Among others, we also show that our octagon embeddings can properly capture a non-trivial class of rule bases. Finally, we show that our model achieves competitive experimental results.
List of keywords
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
Data Mining -> DM: Knowledge graphs and knowledge base completion
Machine Learning -> ML: Neuro-symbolic methods
2942
Enabling Mixed Effects Neural Networks for Diverse, Clustered Data Using Monte Carlo Methods
Andrej Tschalzev, Paul Nitschke, Lukas Kirchdorfer, Stefan Lüdtke, Christian Bartelt, Heiner Stuckenschmidt
12 min. talk | August 7th at 11:30 | Session: ML: Deep learning architectures
[+] More
[-] Less
Neural networks often assume independence among input data samples, disregarding correlations arising from inherent clustering patterns in real-world datasets (e.g., due to different sites or repeated measurements). Recently, mixed effects neural networks (MENNs) which separate cluster-specific ‘random effects’ from cluster-invariant ‘fixed effects’ have been proposed to improve generalization and interpretability for clustered data. However, existing methods only allow for approximate quantification of cluster effects and are limited to regression and binary targets with only one clustering feature. We present MC-GMENN, a novel approach employing Monte Carlo techniques to train Generalized Mixed Effects Neural Networks. We empirically demonstrate that MC-GMENN outperforms existing mixed effects deep learning models in terms of generalization performance, time complexity, and quantification of inter-cluster variance. Additionally, MC-GMENN is applicable to a wide range of datasets, including multi-class classification tasks with multiple high-cardinality categorical features. For these datasets, we show that MC-GMENN outperforms conventional encoding and embedding methods, simultaneously offering a principled methodology for interpreting the effects of clustering patterns.
List of keywords
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Classification
Machine Learning -> ML: Explainable/Interpretable machine learning
Machine Learning -> ML: Probabilistic machine learning
2947
Ordinal Maximin Guarantees for Group Fair Division
Pasin Manurangsi, Warut Suksompong
12 min. talk | August 8th at 10:00 | Session: GTEP: Fair division
[+] More
[-] Less
We investigate fairness in the allocation of indivisible items among groups of agents using the notion of maximin share (MMS). While previous work has shown that no nontrivial multiplicative MMS approximation can be guaranteed in this setting for general group sizes, we demonstrate that ordinal relaxations are much more useful. For example, we show that if n agents are distributed equally across g groups, there exists a 1-out-of-k MMS allocation for k = O(g log(n/g)), while if all but a constant number of agents are in the same group, we obtain k = O(log n / log log n). We also establish the tightness of these bounds and provide non-asymptotic results for the case of two groups.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Fair division
Game Theory and Economic Paradigms -> GTEP: Computational social choice
2950
On Using Admissible Bounds for Learning Forward Search Heuristics
Carlos Núñez-Molina, Masataro Asai, Pablo Mesejo, Juan Fernandez-Olivares
6 min. talk | August 7th at 15:00 | Session: PS: Planning and Scheduling (1/2)
[+] More
[-] Less
In recent years, there has been growing interest in utilizing modern machine learning techniques to learn heuristic functions for forward search algorithms. Despite this, there has been little theoretical understanding of what they should learn, how to train them, and why we do so. This lack of understanding has resulted in the adoption of diverse training targets (suboptimal vs optimal costs vs admissible heuristics) and loss functions (e.g., square vs absolute errors) in the literature. In this work, we focus on how to effectively utilize the information provided by admissible heuristics in heuristic learning. We argue that learning from poly-time admissible heuristics by minimizing mean square errors (MSE) is not the correct approach, since its result is merely a noisy, inadmissible copy of an efficiently computable heuristic. Instead, we propose to model the learned heuristic as a truncated gaussian, where admissible heuristics are used not as training targets but as lower bounds of this distribution. This results in a different loss function from the MSE commonly employed in the literature, which implicitly models the learned heuristic as a gaussian distribution. We conduct experiments where both MSE and our novel loss function are applied to learning a heuristic from optimal plan costs. Results show that our proposed method converges faster during training and yields better heuristics.
List of keywords
Planning and Scheduling -> PS: Learning in planning and scheduling
Machine Learning -> ML: Knowledge-aided learning
Search -> S: Heuristic search
Machine Learning -> ML: Neuro-symbolic methods
2971
Towards Automatic Composition of ASP Programs from Natural Language Specifications
Manuel Borroto Santana, Irfan Kareem, Francesco Ricca
6 min. talk | August 8th at 11:30 | Session: NLP: Applications
[+] More
[-] Less
This paper moves the first step towards automating the composition of Answer Set Programming (ASP) specifications. In particular, the following contributions are provided: (i) A dataset focused on graph-related problem specifications, designed to develop and assess tools for ASP automatic coding; (ii) A two-step architecture, implemented in the NL2ASP tool, for generating ASP programs from natural language specifications. NL2ASP uses neural machine translation to transform natural language into Controlled Natural Language (CNL) statements. Subsequently, CNL statements are converted into ASP code using the CNL2ASP tool. An experimental analysis confirms the viability of the approach.
List of keywords
Natural Language Processing -> NLP: Applications
Knowledge Representation and Reasoning -> KRR: Logic programming
Machine Learning -> ML: Generative models
2975
ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition
Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin
6 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i.e., the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.
List of keywords
Computer Vision -> CV: Interpretability and transparency
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Representation learning
2979
Epistemic Logic Programs: Non-Ground and Counting Complexity
Thomas Eiter, Johannes K. Fichte, Markus Hecher, Stefan Woltran
6 min. talk | August 7th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (1/2)
[+] More
[-] Less
Answer Set Programming (ASP) is a prominent problem-modeling and solving framework, whose solutions are called answer sets. Epistemic logic programs (ELP) extend ASP to reason about all or some answer sets. Solutions to an ELP can be seen as consequences over multiple collections of answer sets, known as world views. While the complexity of propositional programs is well studied, the non-ground case remains open. This paper establishes the complexity of non-ground ELPs. We provide a comprehensive picture for well-known program fragments, which turns out to be complete for the class NEXPTIME with access to oracles up to SigmaP2. In the quantitative setting, we establish complexity results for counting complexity beyond #EXP. To mitigate high complexity, we establish results in case of bounded predicate arity, reaching up to the fourth level of the polynomial hierarchy. Finally, we provide ETH-tight runtime results for the parameter treewidth, which has applications in quantitative reasoning, where we reason on (marginal) probabilities of epistemic literals.
List of keywords
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
Knowledge Representation and Reasoning -> KRR: Knowledge representation languages
Knowledge Representation and Reasoning -> KRR: Logic programming
Knowledge Representation and Reasoning -> KRR: Non-monotonic reasoning
2980
HVOFusion: Incremental Mesh Reconstruction Using Hybrid Voxel Octree
Shaofan Liu, Junbo Chen, Jianke Zhu
6 min. talk | August 6th at 11:30 | Session: ROB: Robotics (1/2)
[+] More
[-] Less
Incremental scene reconstruction is essential to the navigation in robotics. Most of the conventional methods typically make use of either TSDF (truncated signed distance functions) volume or neural networks to implicitly represent the surface. Due to the voxel representation or involving with time-consuming sampling, they have difficulty in balancing speed, memory storage, and surface quality. In this paper, we propose a novel hybrid voxel-octree approach to effectively fuse octree with voxel structures so that we can take advantage of both implicit surface and explicit triangular mesh representation. Such sparse structure preserves triangular faces in the leaf nodes and produces partial meshes sequentially for incremental reconstruction. This storage scheme allows us to naturally optimize the mesh in explicit 3D space to achieve higher surface quality. We iteratively deform the mesh towards the target and recovers vertex colors by optimizing a shading model. Experimental results on several datasets show that our proposed approach is capable of quickly and accurately reconstructing a scene with realistic colors. Code is available at https://github.com/Frankuzi/HVOFusion
List of keywords
Robotics -> ROB: Localization, mapping, state estimation
Computer Vision -> CV: 3D computer vision
Computer Vision -> CV: Applications
Robotics -> ROB: Robotics and vision
2991
Advancing Generalized Transfer Attack with Initialization Derived Bilevel Optimization and Dynamic Sequence Truncation
Yaohua Liu, Jiaxin Gao, Xuan Liu, Xianghao Jiao, Xin Fan, Risheng Liu
6 min. talk | August 7th at 11:30 | Session: CV: Adversarial learning, adversarial attack and defense methods
[+] More
[-] Less
Transfer attacks generate significant interest for real-world black-box applications by crafting transferable adversarial examples through surrogate models. Whereas, existing works essentially directly optimize the single-level objective w.r.t. the surrogate model, which always leads to poor interpretability of attack mechanism and limited generalization performance over unknown victim models. In this work, we propose the BilEvel Transfer AttacK (BETAK) framework by establishing an initialization derived bilevel optimization paradigm, which explicitly reformulates the nested constraint relationship between the Upper-Level (UL) pseudo-victim attacker and the Lower-Level (LL) surrogate attacker. Algorithmically, we introduce the Hyper Gradient Response (HGR) estimation as an effective feedback for the transferability over pseudo-victim attackers, and propose the Dynamic Sequence Truncation (DST) technique to dynamically adjust the back-propagation path for HGR and reduce computational overhead simultaneously. Meanwhile, we conduct detailed algorithmic analysis and provide convergence guarantee to support non-convexity of the LL surrogate attacker. Extensive evaluations demonstrate substantial improvement of BETAK (e.g., 53.41% increase of attack success rates against IncRes-v2_ens victim) against different victims and defense methods in targeted and untargeted attack scenarios.
List of keywords
Computer Vision -> CV: Adversarial learning, adversarial attack and defense methods
Computer Vision -> CV: Machine learning for vision
2993
Learning from Long-Tailed Noisy Data with Sample Selection and Balanced Loss
Lefan Zhang, Zhang-Hao Tian, Wujun Zhou, Wei Wang
6 min. talk | August 9th at 11:30 | Session: ML: Classification
[+] More
[-] Less
The success of deep learning depends on large-scale and well-curated training data, while data in real-world applications are commonly long-tailed and noisy. Existing methods are usually dependent on label frequency to tackle class imbalance, while the model bias on different classes is not directly related to label frequency and the true label frequency is inaccessible under label noise. To solve this, we propose a robust method for learning from long-tailed noisy data with sample selection and balanced loss. Specifically, we separate the noisy training data into clean labeled set and unlabeled set with sample selection, and train the deep neural network in a semi-supervised manner with a balanced loss based on model bias. Extensive experiments on benchmarks demonstrate that our method outperforms existing state-of-the-art methods.
List of keywords
Machine Learning -> ML: Classification
2994
Preferred Reasoning in ABA by Cycle-Breaking
Kiet Nguyen Anh, Markus Ulbricht
12 min. talk | August 6th at 15:00 | Session: KRR: Argumentation
[+] More
[-] Less
We develop a fixed-parameter tractable (FPT) algorithm for skeptical preferred reasoning in assumption-based argumentation (ABA). To this end we make use of so-called backdoors, i.e. sets of assumptions that need to be evaluated s.t. the remaining ABA framework (ABAF) belongs to a computational beneficial sub-class. In order to identify such target classes, we employ a suitable notion of a dependency graph of an ABAF. We show that these graphs can be constructed in polynomial time and that one can efficiently check sufficient properties ensuring that reasoning in the underlying ABAF is tractable. After establishing the theoretical foundations, we test our implementation against the ASPforABA solver which convincingly won the ABA track of the ICCMA’23 competition. As it turns out, our algorithm outperforms ASPforABA on instances with small backdoor sizes.
List of keywords
Knowledge Representation and Reasoning -> KRR: Argumentation
Knowledge Representation and Reasoning -> KRR: Computational complexity of reasoning
3000
PRASS: Probabilistic Risk-averse Robust Learning with Stochastic Search
Tianle Zhang, Yanghao Zhang, Ronghui Mu, Jiaxu Liu, Jonathan Fieldsend, Wenjie Ruan
6 min. talk | August 9th at 10:00 | Session: ETF: Trustworthy AI
[+] More
[-] Less
Deep learning models, despite their remarkable success in various tasks, have been shown to be vulnerable to adversarial perturbations. Although robust learning techniques that consider adversarial risks against worst-case perturbations can effectively increase a model’s robustness, they may not always be the most suitable approach. This is due to the fact that in certain scenarios, perturbations are more likely to occur probabilistically rather than being intentionally crafted by attackers. To address this challenge, we propose a novel risk-averse robust learning method based on entropic value-at-risk, called PRASS (Probabilistical Risk-Averse Robust Learning with Stochastic Search). Our approach leverages principles of stochastic optimisation and considers perturbing distributions rather than solely worst-case adversaries. By applying adaptive stochastic search to parameterised distributions, we further enhance the scalability of PRASS to handle distributional robustness. Empirical experiments demonstrate that PRASS outperforms existing state-of-the-art baselines.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Trustworthy AI
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
Machine Learning -> ML: Adversarial machine learning
Machine Learning -> ML: Robustness
3009
CPa-WAC: Constellation Partitioning-based Scalable Weighted Aggregation Composition for Knowledge Graph Embedding
Sudipta Modak, Aakarsh Malhotra, Sarthak Malik, Anil Surisetty, Esam Abdel-Raheem
6 min. talk | August 8th at 15:00 | Session: KRR: Knowledge Representation and Reasoning (2/2)
[+] More
[-] Less
Scalability and training time are crucial for any graph neural network model processing a knowledge graph (KG). While partitioning knowledge graphs helps reduce the training time, the prediction accuracy reduces significantly compared to training the model on the whole graph. In this paper, we propose CPa-WAC: a lightweight architecture that incorporates graph convolutional networks and modularity maximization-based constellation partitioning to harness the power of local graph topology. The proposed CPa-WAC method reduces the training time and memory cost of knowledge graph embedding, making the learning model scalable. The results from our experiments on standard databases, such as Wordnet and Freebase, show that by achieving meaningful partitioning, any knowledge graph can be broken down into subgraphs and processed separately to learn embeddings. Furthermore, these learned embeddings can be used for knowledge graph completion, retaining similar performance compared to training a GCN on the whole KG, while speeding up the training process by upto five times. Additionally, the proposed CPa-WAC method outperforms several other state-of-the-art KG in terms of prediction accuracy.
List of keywords
Knowledge Representation and Reasoning -> KRR: Applications
Uncertainty in AI -> UAI: Graphical models
Machine Learning -> ML: Knowledge-aided learning
Data Mining -> DM: Knowledge graphs and knowledge base completion
3020
Mechanisms That Play a Game, Not Toss a Coin
Toby Walsh
6 min. talk | August 8th at 11:30 | Session: GTEP: Mechanism design
[+] More
[-] Less
Randomized mechanisms can have good normative properties compared to their deterministic counter-parts. However, randomized mechanisms are problematic in several ways such as in their verifiability. We propose here to de-randomize such mechanisms by having agents play a game instead of tossing a coin. The game is designed so agents play randomly, and this play injects “randomness” into the mechanism. Surprisingly this de-randomization retains many of the good normative properties of the original randomized mechanism but gives a mechanism that is deterministic and easy, for instance, to audit. We consider three general purpose methods to de-randomize mechanisms, and apply these to six different domains: voting, facility location, task allocation, school choice, peer selection, and resource allocation. We propose a number of novel de-randomized mechanisms for these six domains with good normative properties (such as equilibria in which agents sincerely report preferences over the original problem). In one domain, we additionally show that a new and desirable normative property emerges as a result of de-randomization.property emerges as a result of de-randomization.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Mechanism design
Agent-based and Multi-agent Systems -> MAS: Resource allocation
Game Theory and Economic Paradigms -> GTEP: Computational social choice
Game Theory and Economic Paradigms -> GTEP: Fair division
3024
Scalable Landmark Hub Labeling for Optimal and Bounded Suboptimal Pathfinding
Sabine Storandt
6 min. talk | August 7th at 15:00 | Session: PS: Planning and Scheduling (1/2)
[+] More
[-] Less
Hub Labeling and A* are two well-established algorithms for shortest path computation in large graphs. Hub Labeling offers excellent query times for distance computation, but at the cost of a high space consumption for label storage. Landmark-based A* search requires less space but answers queries much slower. Recently, Landmark Hub Labeling (LHL) has been proposed, which combines both concepts and achieves a smaller space consumption than Hub Labeling and also much better query times than A*. However, the known algorithms for computing a LHL do not scale to large graphs, limiting its applicability. In this paper, we devise novel algorithms for LHL construction that work on graphs with millions of edges. We also further improve the LHL query answering algorithm and investigate how to reduce the space consumption of labeling techniques by performing bounded suboptimal pathfinding. In an extensive experimental study, we demonstrate the effectiveness of our methods and illuminate that sensible trade-offs between space consumption, query time, and path quality can be achieved with LHL.
List of keywords
Planning and Scheduling -> PS: Routing
Multidisciplinary Topics and Applications -> MTA: Transportation
Search -> S: Applications
Search -> S: Combinatorial search and optimisation
3030
Toward Completing the Picture of Control in Schulze and Ranked Pairs Elections
Cynthia Maushagen, David Niclaus, Paul Nüsken, Jörg Rothe, Tessa Seeger
6 min. talk | August 7th at 15:00 | Session: GTEP: Computational social choice (2/2)
[+] More
[-] Less
Both Schulze and ranked pairs are voting rules that satisfy many natural, desirable axioms. Many standard types of electoral control (with a chair seeking to change the outcome of an election by interfering with the election structure) have already been studied. However, for control by replacing candidates or voters and for (exact) multimode control that combines multiple standard attacks, many questions remain open. We solve a number of these open cases for Schulze and ranked pairs. In addition, we fix a flaw in the reduction of Menton and Singh showing that Schulze is resistant to constructive control by deleting candidates and re-establish a vulnerability result for destructive control by deleting candidates. In some of our proofs, we study variants of s-t vertex cuts in graphs that are related to our control problems.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
3049
Attention Based Document-level Relation Extraction with None Class Ranking Loss
Xiaolong Xu, Chenbin Li, Haolong Xiang, Lianyong Qi, Xuyun Zhang, Wanchun Dou
6 min. talk | August 7th at 11:30 | Session: NLP: Information extraction
[+] More
[-] Less
Through document-level relation extraction (RE), the analysis of the global relation between entities in the text is feasible, and more comprehensive and accurate semantic information can be obtained. In document-level RE, the model needs to infer the implicit relations between two entities in different sentences. To obtain more semantic information, existing methods mainly focus on exploring entity representations. However, they ignore the correlations and indivisibility between relations, entities and contexts. Furthermore, current methods only independently estimate the cases of predefined relations, ignoring the case of "no relation”, which results in poor prediction. To address the above issues, we propose a document-level RE method based on attention mechanisms, which considers the case of "no relation”. Specifically, our approach leverages graph attention and multi-head attention networks to capture the correlations and indivisibility among relations, entities, and contexts, respectively. In addition, a novel multi-label loss function that promotes large margins in label confidence scores between each predefined class and the none class is employed to improve the prediction performance. Extensive experiments conducted on benchmarking datasets demonstrate that our proposed method outperforms the state-of-the-art baselines with higher accuracy.
List of keywords
Natural Language Processing -> NLP: Information extraction
Natural Language Processing -> NLP: Embeddings
3057
Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces
Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou
12 min. talk | August 8th at 10:00 | Session: MTA: Security and privacy
[+] More
[-] Less
Deepfake videos are becoming increasingly realistic, showing few tampering traces on facial areas that vary between frames. Consequently, existing Deepfake detection methods struggle to detect unknown domain Deepfake videos while accurately locating the tampered region. To address this limitation, we propose Delocate, a novel Deepfake detection model that can both recognize and localize unknown domain Deepfake videos. Our method consists of two stages named recovering and localization. In the recovering stage, the model randomly masks regions of interest (ROIs) and reconstructs real faces without tampering traces, leading to a relatively good recovery effect for real faces and a poor recovery effect for fake faces. In the localization stage, the output of the recovery phase and the forgery ground truth mask serve as supervision to guide the forgery localization process. This process strategically emphasizes the recovery phase of fake faces with poor recovery, facilitating the localization of tampered regions. Our extensive experiments on four widely used benchmark datasets demonstrate that Delocate not only excels in localizing tampered areas but also enhances cross-domain detection performance.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Security and privacy
Computer Vision -> CV: Biometrics, face, gesture and pose recognition
3058
Deep Multi-Dimensional Classification with Pairwise Dimension-Specific Features
Teng Huang, Bin-Bin Jia, Min-Ling Zhang
6 min. talk | August 9th at 11:30 | Session: ML: Classification
[+] More
[-] Less
In multi-dimensional classification (MDC), each instance is associated with multiple class variables characterizing the semantics of objects from different dimensions. To consider the dependencies among class variables and the specific characteristics contained in different semantic dimensions, a novel deep MDC approach named PIST is proposed to jointly deal with the two issues via learning pairwise dimension-specific features. Specifically, PIST conducts pairwise grouping to model the dependencies between each pair of class variables, which are more reliable with limited training samples. For extracting pairwise dimension-specific features, PIST weights the feature embedding with a feature importance vector, which is learned via utilizing a global loss measurement based on intra-class and inter-class covariance. Final prediction w.r.t. each dimension is determined by combining the joint probabilities related to this dimension. Comparative studies with eleven real-world MDC data sets clearly validate the effectiveness of the proposed approach.
List of keywords
Machine Learning -> ML: Classification
Machine Learning -> ML: Multi-label learning
3083
Fast One-Stage Unsupervised Domain Adaptive Person Search
Tianxiang Cui, Huibing Wang, Jinjia Peng, Ruoxi Deng, Xianping Fu, Yang Wang
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Unsupervised person search aims to localize a particular target person from a gallery set of scene images without annotations, which is extremely challenging due to the unexpected variations of the unlabeled domains. However, most existing methods dedicate to developing multi-stage models to adapt domain variations while using clustering for iterative model training, which inevitably increase model complexity. To address this issue, we propose a Fast One-stage Unsupervised person Search (FOUS) which complementaryly integrates domain adaption with label adaption within an end-to-end manner without iterative clustering. To minimize the domain discrepancy, FOUS introduced an Attention-based Domain Alignment Module (ADAM) which can not only align various domains for both detection and ReID tasks but also construct an attention mechanism to reduce the adverse impacts of low-quality candidates resulting from unsupervised detection. Moreover, to avoid the redundant iterative clustering mode, FOUS adopts a prototype-guided labeling method which minimizes redundant correlation computations for partial samples and assigns noisy coarse label groups efficiently. The coarse label groups will be continuously refined via label-flexible training network with an adaptive selection strategy. With the adapted domains and labels, FOUS can achieve the state-of-the-art (SOTA) performance on two benchmark datasets, CUHK-SYSU and PRW. The code is available at https://github.com/whbdmu/FOUS.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Image and video retrieval 
3090
Discriminative Feature Decoupling Enhancement for Speech Forgery Detection
Yijun Bei, Xing Zhou, Erteng Liu, Yang Gao, Sen Lin, Kewei Gao, Zunlei Feng
6 min. talk | August 6th at 15:00 | Session: ETF: AI Ethics, Trust, Fairness (1/2)
[+] More
[-] Less
The emergence of AIGC has brought attention to the issue of generating realistic deceptive content. While AIGC has the potential to revolutionize content creation, it also facilitates criminal activities. Specifically, the manipulation of speech has been exploited in tele-fraud and financial fraud schemes, posing a significant threat to societal security. Current deep learning-based methods for detecting forged speech extract mixed features from the original speech, which often contain redundant information. Moreover, these methods fail to consider the distinct characteristics of human voice-specific features and the diversity of background environmental sounds. This paper introduces a framework called Discriminative fEature dEcoupling enhanceMent (DEEM) for detecting speech forgery. Initially, the framework decouples the original speech into human voice features and background sound features. Subsequently, DEEM enhances voice-specific features through temporal dimension aggregation and improves continuity-related features in the background sound map via spectral-dimension aggregation. By employing the decoupling enhancement features, extensive experiments demonstrate that DEEM achieves an accuracy improvement of over 5% on FoR dataset compared to the state-of-the-art methods.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Safety and robustness
AI Ethics, Trust, Fairness -> ETF: AI and law, governance, regulation
AI Ethics, Trust, Fairness -> ETF: Fairness and diversity
AI Ethics, Trust, Fairness -> ETF: Societal impact of AI
3100
Rethinking Correlation Learning via Label Prior for Open Set Domain Adaptation
Zi-Xian Huang, Chuan-Xian Ren
12 min. talk | August 9th at 11:30 | Session: CV: Machine learning for vision
[+] More
[-] Less
Open Set Domain Adaptation (OSDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain, where known classes exist across domains while unknown classes are present only in the target domain. Existing methods rely on the clustering structure to identify the unknown classes, which empirically induces a large identification error if the unknown classes are a mixture of multiple components. To break through this barrier, we formulate OSDA from the view of correlation and propose a correlation metric-based framework called Balanced Correlation Learning (BCL). BCL employs Hilbert-Schmidt Independence Criterion (HSIC) to characterize the separation between unknown and known classes, where HSIC is reformulated as the nodes’ relation on graph. By considering the label prior as variable, theoretical results are derived to analytically show a sufficient condition for desired learning direction for OSDA. Methodologically, the class-balanced HSIC is proposed to preserve domain-invariant and class-discriminative features. With the guarantee of correlation learning, the entropy-based principle can effectively identify the unknown classes via uncertainty. Empirically, extensive evaluations are conducted, where BCL achieves significant performance improvements.
List of keywords
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
3106
Exploring the Role of Node Diversity in Directed Graph Representation Learning
Jincheng Huang, Yujie Mo, Ping Hu, Xiaoshuang Shi, Shangbo Yuan, Zeyu Zhang, Xiaofeng Zhu
6 min. talk | August 6th at 15:00 | Session: DM: Mining graphs (2/3)
[+] More
[-] Less
Many methods of Directed Graph Neural Networks (DGNNs) are designed to equally treat nodes in the same neighbor set (i.e., out-neighbor set and in-neighbor set) for every node, without considering the node diversity in directed graphs, so they are often unavailable to adaptively acquire suitable information from neighbors of different directions. To alleviate this issue, in this paper, we investigate a new way to first consider node diversity for representation learning on directed graphs, i.e., neighbor diversity and degree diversity, and then propose a new NDDGNN framework to adaptively assign weights to both outgoing information and incoming information at the node level. Extensive experiments on seven real-world datasets validate the superior performance of our method compared to state-of-the-art methods in terms of both node classification and link prediction tasks.
List of keywords
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Semi-supervised learning
3110
With a Little Help from Language: Semantic Enhanced Visual Prototype Framework for Few-Shot Learning
Hecheng Cai, Yang Liu, Shudong Huang, Jiancheng Lv
6 min. talk | August 6th at 11:30 | Session: ML: Multi-modal learning
[+] More
[-] Less
Few-shot learning (FSL) aims to recognize new categories given limited training samples. The core challenge is to avoid overfitting to the minimal data while ensuring good generalization to novel classes. One mainstream method employs prototypes from visual feature extractors as classifier weight and the performance depends on the quality of the prototype. Since different categories may have similar visual features, the visual prototype has limitations. This is because existing methods only learn a simple visual feature extractor during the pre-training stage but neglect the importance of a well-developed feature space for the prototype. We introduce the Semantic Enhanced Visual Prototype framework (SEVpro) to address this issue. SEVpro refines prototype learning from the pre-training stage and serves as a versatile plug-and-play framework for all prototype-based FSL methods. Specifically, we enhance prototype discriminability by transforming semantic embeddings into the visual space, aiding in separating categories with similar visual features. For novel class learning, we leverage knowledge from base classes and incorporate semantic information to elevate prototype quality further. Meanwhile, extensive experiments on FSL benchmarks and ablation studies demonstrate the superiority of our proposed SEVpro for FSL.
List of keywords
Machine Learning -> ML: Few-shot learning
Machine Learning -> ML: Multi-modal learning
3113
SAEIR: Sequentially Accumulated Entropy Intrinsic Reward for Cooperative Multi-Agent Reinforcement Learning with Sparse Reward
Xin He, Hongwei Ge, Yaqing Hou, Jincheng Yu
6 min. talk | August 6th at 11:30 | Session: ML: Multiagent Reinforcement Learning
[+] More
[-] Less
Multi-agent reinforcement learning (MARL) performs well for solving complex cooperative tasks when the scenarios have well-defined dense rewards. However, there are usually sparse reward settings in many real-world multi-agent systems, which makes it difficult for MARL algorithms to successfully learn an effective strategy. To tackle this problem, we propose a novel sequentially accumulated entropy intrinsic reward named SAEIR, which utilizes the entropy of multi-agent system as a bonus to accelerate learning. Specifically, the multi-scale hypergraph critic is proposed to obtain high-order system state representation, which also enhances the ability to effectively evaluate the action produced by the actor. Based on the comprehensive and compact system state representation, the orderliness of multi-agent systems can be measured to determine the highly valuable states for adding entropy-based intrinsic rewards which leads to a highly efficient learning process. Empirical results demonstrate that our proposed method achieves state-of-the-art performance in several complex cooperative multi-agent environments with sparse reward settings.
List of keywords
Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
3119
Towards Dynamic-Prompting Collaboration for Source-Free Domain Adaptation
Mengmeng Zhan, Zongqian Wu, Rongyao Hu, Ping Hu, Heng Tao Shen, Xiaofeng Zhu
12 min. talk | August 6th at 15:00 | Session: CV: Transfer, low-shot, semi- and un- supervised learning   
[+] More
[-] Less
In domain adaptation, challenges such as data privacy constraints can impede access to source data, catalyzing the development of source-free domain adaptation (SFDA) methods. However, current approaches heavily rely on models trained on source data, posing the risk of overfitting and suboptimal generalization.This paper introduces a dynamic prompt learning paradigm that harnesses the power of large-scale vision-language models to enhance the semantic transfer of source models. Specifically, our approach fosters robust and adaptive collaboration between the source-trained model and the vision-language model, facilitating the reliable extraction of domain-specific information from unlabeled target data, while consolidating domain-invariant knowledge. Without the need for accessing source data, our method amalgamates the strengths inherent in both traditional SFDA approaches and vision-language models, formulating a collaborative framework for addressing SFDA challenges. Extensive experiments conducted on three benchmark datasets showcase the superiority of our framework over previous SOTA methods.
List of keywords
Computer Vision -> CV: Transfer, low-shot, semi- and un- supervised learning   
Computer Vision -> CV: Multimodal learning
Computer Vision -> CV: Representation learning
3125
Hard-Thresholding Meets Evolution Strategies in Reinforcement Learning
Chengqian Gao, William de Vazelhes, Hualin Zhang, Bin Gu, Zhiqiang Xu
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (1/6)
[+] More
[-] Less
Evolution Strategies (ES) have emerged as a competitive alternative for model-free reinforcement learning, showcasing exemplary performance in tasks like Mujoco and Atari. Notably, they shine in scenarios with imperfect reward functions, making them invaluable for real-world applications where dense reward signals may be elusive. Yet, an inherent assumption in ES—that all input features are task-relevant—poses challenges, especially when confronted with irrelevant features common in real-world problems. This work scrutinizes this limitation, particularly focusing on the Natural Evolution Strategies (NES) variant. We propose NESHT, a novel approach that integrates Hard-Thresholding (HT) with NES to champion sparsity, ensuring only pertinent features are employed. Backed by rigorous analysis and empirical tests, NESHT demonstrates its promise in mitigating the pitfalls of irrelevant features and shines in complex decision-making problems like noisy Mujoco and Atari tasks. Our code is available at https://github.com/cangcn/NES-HT.
List of keywords
Machine Learning -> ML: Evolutionary learning
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Machine Learning -> ML: Learning sparse models
Machine Learning -> ML: Optimization
3146
SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models
Ran Zuo, Haoxiang Hu, Xiaoming Deng, Cangjun Gao, Zhengming Zhang, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang
6 min. talk | August 6th at 11:30 | Session: CV: Video analysis and understanding   
[+] More
[-] Less
Jointly using text and sketch for scene-level image retrieval utilizes the complementary between text and sketch to describe the fine-grained scene content and retrieve the target image, which plays a pivotal role in accurate image retrieval. Existing methods directly fuse the features of sketch and text and thus suffer from the bottleneck of limited utilization for crucial semantic and structural information, leading to inaccurate matching with images. In this paper, we propose SceneDiff, a novel retrieval network that leverages a pre-trained diffusion model to establish a shared generative latent space, enabling a joint latent representation learning for both sketch and text features and precise alignment with the corresponding image. Specifically, we encode text, sketch and image features, and project them into the diffusion-based share space, conditioning the denoising process on sketch and text features to generate latent fusion features, while employing the pre-trained autoencoder for latent image features. Within this space, we introduce the content-aware feature transformation module to reconcile encoded sketch and image features with the diffusion latent space’s dimensional requirements and preserve their visual content information. Then we augment the representation capability of the generated latent fusion features by integrating multiple samplings with partition attention, and utilize contrastive learning to align both direct fusion features and generated latent fusion features with corresponding image representations. Our method outperforms the state-of-the-art works through extensive experiments, providing a novel insight into the related retrieval field.
List of keywords
Computer Vision -> CV: Image and video retrieval 
Computer Vision -> CV: Multimodal learning
3164
Pointsoup: High-Performance and Extremely Low-Decoding-Latency Learned Geometry Codec for Large-Scale Point Cloud Scenes
Kang You, Kai Liu, Li Yu, Pan Gao, Dandan Ding
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (4/6)
[+] More
[-] Less
Despite considerable progress being achieved in point cloud geometry compression, there still remains a challenge in effectively compressing large-scale scenes with sparse surfaces. Another key challenge lies in reducing decoding latency, a crucial requirement in real-world application. In this paper, we propose Pointsoup, an efficient learning-based geometry codec that attains high-performance and extremely low-decoding-latency simultaneously. Inspired by conventional Trisoup codec, a point model-based strategy is devised to characterize local surfaces. Specifically, skin features are embedded from local windows via an attention-based encoder, and dilated windows are introduced as cross-scale priors to infer the distribution of quantized features in parallel. During decoding, features undergo fast refinement, followed by a folding-based point generator that reconstructs point coordinates with fairly fast speed. Experiments show that Pointsoup achieves state-of-the-art performance on multiple benchmarks with significantly lower decoding complexity, i.e., up to 90~160× faster than the G-PCCv23 Trisoup decoder on a comparatively low-end platform (e.g., one RTX 2080Ti). Furthermore, it offers variable-rate control with a single neural model (2.9MB), which is attractive for industrial practitioners.
List of keywords
Machine Learning -> ML: Geometric learning
Computer Vision -> CV: 3D computer vision
Multidisciplinary Topics and Applications -> MTA: Real-time systems
Robotics -> ROB: Robotics and vision
3165
ADMN: Agent-Driven Modular Network for Dynamic Parameter Sharing in Cooperative Multi-Agent Reinforcement Learning
Yang Yu, Qiyue Yin, Junge Zhang, Pei Xu, Kaiqi Huang
6 min. talk | August 7th at 15:00 | Session: MAS: Multi-agent learning
[+] More
[-] Less
Parameter sharing is a common strategy in multi-agent reinforcement learning (MARL) to make the training more efficient and scalable. However, applying parameter sharing among agents indiscriminately hinders the emergence of agents diversity and degrades the final cooperative performance. To better balance parameter sharing and agents diversity, we propose a novel Agent-Driven Modular Network (ADMN), where agents share a base network consisting of multiple specialized modules, and each agent has its own routing to connect these modules. In ADMN, modules are shared among agents to improve the training efficiency, while the combination of different modules brings rich diversity. The agent routing at different time steps is learned end-to-end to achieve a dynamic and adaptive balance. Specifically, we also propose an information-theoretical regularization between the routing of agents and their behavior to further guarantee the identifiability of different routing. We evaluated ADMN in challenging StarCraft micromanagement games and Google Research Football games, and results demonstrate the superior performance of ADMN, particularly in larger or heterogeneous cooperative tasks.
List of keywords
Agent-based and Multi-agent Systems -> MAS: Multi-agent learning
Machine Learning -> ML: Reinforcement learning
3180
Unlearning from Weakly Supervised Learning
Yi Tang, Yi Gao, Yong-gang Luo, Ju-Cheng Yang, Miao Xu, Min-Ling Zhang
12 min. talk | August 8th at 10:00 | Session: ML: Weakly supervised learning
[+] More
[-] Less
Machine unlearning provides users with the right to remove their privacy data from a well-trained model. Existing approaches of machine unlearning mainly focus on exploring data removing within supervised learning (SL) tasks. However, weakly supervised learning (WSL) is more applicable to real-world scenarios since collecting WSL data is less laborious than collecting fully supervised data. In this paper, we first propose a machine unlearning approach for WSL by updating the model parameters. Motivated by the uniform distributions of untrained model predictions, we derive a formulated target to force the model’s predictions of removed data to be indistinguishable. This encourages the model to forget its ability to recognize features of data slated for unlearning. Moreover, we employ formulated targets to transform the classification unlearning into the convex regression, which can significantly reduce computational cost and avoid extra information storage during the training process. Additionally, we discuss how to design a target to ensure the models’ predictions of removed data being indistinguishable in different learning scenarios, e.g., SL or WSL. As the flexibility in formulating targets, the proposed approach effectively deals with the WSL problem while still excels in SL models. Empirical studies show the superiority of the proposed approach.
List of keywords
Machine Learning -> ML: Weakly supervised learning
Machine Learning -> ML: Other
3202
A Prior-information-guided Residual Diffusion Model for Multi-modal PET Synthesis from MRI
Zaixin Ou, Caiwen Jiang, Yongsheng Pan, Yuanwang Zhang, Zhiming Cui, Dinggang Shen
6 min. talk | August 9th at 11:30 | Session: ML: Generative models
[+] More
[-] Less
Alzheimer’s disease (AD) leads to abnormalities in various biomarkers (i.e., amyloid-β and tau proteins), which makes PET imaging (which can detect these biomarkers) essential in AD diagnosis. However, the high radiation risk of PET imaging limits its scanning number within a short period, presenting challenges to the joint multi-biomarker diagnosis of AD. In this paper, we propose a novel unified model to simultaneously synthesize multi-modal PET images from MRI, to achieve low-cost and time-efficient joint multi-biomarker diagnosis of AD. Specifically, we incorporate residual learning into the diffusion model to emphasize inter-domain differences between PET and MRI, thereby forcing each modality to maximally reconstruct its modality-specific details. Furthermore, we leverage prior information, such as age and gender, to guide the diffusion model in synthesizing PET images with semantic consistency, enhancing their diagnostic value. Additionally, we develop an intra-domain difference loss to ensure that the intra-domain differences among synthesized PET images closely match those among real PET images, promoting more accurate synthesis, especially at the modality-specific information. Extensive experiments conducted on the ADNI dataset demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to the state-of-the-art methods. All codes for this study have been uploaded to GitHub (https://github.com/Ouzaixin/ResDM).
List of keywords
Machine Learning -> ML: Generative models
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Supervised Learning
3203
Fine-grained Analysis of Stability and Generalization for Stochastic Bilevel Optimization
Xuelin Zhang, Hong Chen, Bin Gu, Tieliang Gong, Feng Zheng
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (3/6)
[+] More
[-] Less
Stochastic bilevel optimization (SBO) has been integrated into many machine learning paradigms recently including hyperparameter optimization, meta learning, reinforcement learning, etc. Along with the wide range of applications, there have been abundant studies on concerning the computing behaviors of SBO. However, the generalization guarantees of SBO methods are far less understood from the lens of statistical learning theory. In this paper, we provide a systematical generalization analysis of the first-order gradient-based bilevel optimization methods. Firstly, we establish the quantitative connections between the on-average argument stability and the generalization gap of SBO methods. Then, we derive the upper bounds of on-average argument stability for single timescale stochastic gradient descent (SGD) and two timescale SGD, where three settings (nonconvex-nonconvex (NC-NC), convex-convex (C-C) and strongly-convex-strongly-convex (SC-SC)) are considered respectively. Experimental analysis validates our theoretical findings. Compared with the previous algorithmic stability analysis, our results do not require the re-initialization of the inner-level parameters before each iteration and are suit for more general objective functions.
List of keywords
Machine Learning -> ML: Learning theory
3228
Boosting Diffusion Models with an Adaptive Momentum Sampler
Xiyu Wang, Anh-Dung Dinh, Daochang Liu, Chang Xu
12 min. talk | August 8th at 11:30 | Session: CV: Image and video synthesis and generation (1/2)
[+] More
[-] Less
Diffusion probabilistic models (DPMs) have been shown to generate high-quality images without the need for delicate adversarial training. The sampling process of DPMs is mathematically similar to Stochastic Gradient Descent (SGD), with both being iteratively updated with a function increment. Building on this, we present a novel reverse sampler for DPMs in this paper, drawing inspiration from the widely-used Adam optimizer. Our proposed sampler can be readily applied to a pre-trained diffusion model, utilizing momentum mechanisms and adaptive updating to enhance the generated image’s quality. By effectively reusing update directions from early steps, our proposed sampler achieves a better balance between high-level semantics and low-level details. Additionally, this sampler is flexible and can be easily integrated into pre-trained DPMs regardless of the sampler used during training. Our experimental results on multiple benchmarks demonstrate that our proposed reverse sampler yields remarkable improvements over different baselines.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
3230
Image Retrieval with Self-Supervised Divergence Minimization and Cross-Attention Classification
Vivek Trivedy, Longin Jan Latecki
12 min. talk | August 8th at 15:00 | Session: CV: Computer Vision (2/2)
[+] More
[-] Less
Common approaches to image retrieval include contrastive methods and specialized loss functions such as ranking losses and entropy regularizers. We present DMCAC (Divergence Minimization with Cross-Attention Classification), a novel image retrieval method that offers a new perspective on this training paradigm. We use self-supervision with a novel divergence loss framework alongside a simple data flow adjustment that minimizes a distribution over a database directly during training. We show that jointly learning a query representation over a database is a competitive and often improved alternative to traditional contrastive methods for image retrieval. We evaluate our method across several model configurations and four datasets, achieving state-of-the-art performance in multiple settings. We also conduct a thorough set of ablations that show the robustness of our method across full vs. approximate retrieval and different hyperparameter configurations.
List of keywords
Computer Vision -> CV: Image and video retrieval 
Computer Vision -> CV: Representation learning
3244
Learning Label Dependencies for Visual Information Extraction
Minghong Yao, Liansheng Zhuang, Houqiang Li, Jiuchang Wei
6 min. talk | August 8th at 11:30 | Session: NLP: Applications
[+] More
[-] Less
Visual Information Extraction (VIE), which aims to extract structured information from visually rich document images, has drawn much attention due to its wide applications in document understanding. However, previous methods often treat the VIE task as a sequence labeling problem and ignore the label correlations in the sequence, which may significantly degrade their performance. To address this issue, this paper proposes a novel framework to exploit the potential of label correlations to improve the VIE models’ performance. Its key idea is to learn the label dependency of entities, and use it to regularize the label sequence. Specifically, to capture the label dependency of entities, a label transformer is pre-trained to assign a higher likelihood to the label sequence that respects the label patterns of document layouts. During testing stages, an inference transformer is used to predict the label sequence by considering not only the features of each entity but also the likelihood of the label sequence evaluated by the label transformer. Our framework can be combined with existing popular VIE models such as LayoutLM and GeoLayoutLM. Extensive experiments on public datasets have demonstrated the effectiveness of our framework.
List of keywords
Natural Language Processing -> NLP: Applications
Natural Language Processing -> NLP: Information extraction
3245
Nonparametric Detection of Gerrymandering in Multiparty Plurality Elections
Dariusz Stolicki, Wojciech Słomczyński, Stanisław Szufa
6 min. talk | August 7th at 15:00 | Session: GTEP: Computational social choice (2/2)
[+] More
[-] Less
Partisan gerrymandering, i.e., manipulation of electoral district boundaries for political advantage, is one of the major challenges to election integrity in modern day democracies. Yet most of the existing methods for detecting partisan gerrymandering are narrowly tailored toward fully contested two-party elections, and fail if there are more parties or if the number of candidates per district varies. We propose a new method, applying nonparametric statistical learning to detect anomalies in the relation between (aggregate) votes and (aggregate) seats. Unlike in most of the existing methods, we propose to learn the standard of fairness in districting from empirical data rather than assume one a priori. Finally, we test the proposed methods against experimental data as well as real-life data from 17 countries employing the plurality (FPTP) system.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
Multidisciplinary Topics and Applications -> MTA: Social sciences
3261
An Image-enhanced Molecular Graph Representation Learning Framework
Hongxin Xiang, Shuting Jin, Jun Xia, Man Zhou, Jianmin Wang, Li Zeng, Xiangxiang Zeng
6 min. talk | August 8th at 15:00 | Session: MTA: Bioinformatics
[+] More
[-] Less
Extracting rich molecular representation is a crucial prerequisite for accurate drug discovery. Recent molecular representation learning methods achieve impressive progress, but the paradigm of learning from a single modality gradually encounters the bottleneck of limited representation capabilities. In this work, we fully consider the rich visual information contained in 3D conformation molecular images (i.e., texture, shadow, color and planar spatial information) and distill graph-based models for more discriminative drug discovery. Specifically, we propose an image-enhanced molecular graph representation learning framework that leverages multi-view molecular images rendered from 3D conformations to boost molecular graph representations. To extract useful auxiliary knowledge from multi-view images, we design a teacher, which is pre-trained on 2 million molecules with conformations through five meticulously designed pre-training tasks. To transfer knowledge from teacher to graph-based students, we pose an efficient cross-modal knowledge distillation strategy with knowledge enhancer and task enhancer. It is worth noting that the distillation architecture of IEM can be directly integrated into existing graph-based models, and significantly improves the capabilities of these models (e.g. GIN, EdgePred, GraphMVP, MoleBERT) for molecular representation learning. In particular, GraphMVP and MoleBERT equipped with IEM achieve new state-of-the-art performance on MoleculeNet benchmark, achieving average 73.89% and 73.81% ROC-AUC, respectively. Code is available at https://github.com/HongxinXiang/IEM.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Bioinformatics
Machine Learning -> ML: Knowledge-aided learning
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Representation learning
3265
Formalisation and Evaluation of Properties for Consequentialist Machine Ethics
Raynaldio Limarga, Yang Song, Abhaya Nayak, David Rajaratnam, Maurice Pagnucco
6 min. talk | August 6th at 15:00 | Session: ETF: AI Ethics, Trust, Fairness (1/2)
[+] More
[-] Less
As artificial intelligence (AI) technologies continue to influence our daily lives, there has been a growing need to ensure that AI enabled decision making systems adhere to principles expected of human decision makers. This need has given rise to the area of Machine Ethics. We formalise several ethical principles from the philosophical literature in the situation calculus framework to verify the ethical permissibility of a plan. Moreover, we propose several important properties, including some of our own that are intuitively appealing, and a number derived from the social choice literature that would appear to be relevant in evaluating the various approaches. Finally we provide an assessment of how our various situation calculus models of Machine Ethics that we examine satisfy the important properties we have identified.
List of keywords
AI Ethics, Trust, Fairness -> ETF: Moral decision making
Knowledge Representation and Reasoning -> KRR: Reasoning about actions
Knowledge Representation and Reasoning -> KRR: Common-sense reasoning
Knowledge Representation and Reasoning -> KRR: Other
3276
Multi-scale Context-Aware Networks Based on Fragment Association for Human Activity Recognition
Zhiqiong Wang, Hanyu Liu, Boyang Zhao, Qi Shen, Mingzhe Li, Ningfeng Que, Mingke Yan, Junchang Xin
6 min. talk | August 7th at 15:00 | Session: HAI: Humans and AI
[+] More
[-] Less
Sensor-based Human Activity Recognition (HAR) constitutes a key component of many artificial intelligence applications. Although deep feature extraction technology is constantly updated and iterated with excellent results, it is still a difficult task to find a balance between performance and computational efficiency. Through an in-depth exploration of the inherent characteristics of HAR data, we propose a lightweight feature perception model, which encompasses an internal feature extractor and a contextual feature perceiver. The model mainly consists of two stages. The first stage is a hierarchical multi-scale feature extraction module, which is composed of deep separable convolution and multi-head attention mechanism. This module serves to extract conventional features for Human Activity Recognition. After the feature goes through a fragment recombination operation, it is passed into the Context-Aware module of the second stage, which is based on Retentive Transformer and optimized by Dropkey method to efficiently extract the relationship between the feature fragments, so as to mine more valuable feature information. Importantly, this does not add too much complexity to the model, thereby preventing excessive resource consumption. We conducted extensive experimental validation on multiple publicly available HAR datasets.
List of keywords
Humans and AI -> HAI: Applications
Data Mining -> DM: Networks
3281
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory
Zicheng Liu, Li Wang, Siyuan Li, Zedong Wang, Haitao Lin, Stan Z. Li
6 min. talk | August 6th at 15:00 | Session: ML: Machine Learning (2/6)
[+] More
[-] Less
Transformer models have been successful in various sequence processing tasks, but the self-attention mechanism’s computational cost limits its practicality for long sequences. Although there are existing attention variants that improve computational efficiency, they have a limited ability to abstract global information effectively based on their hand-crafted mixing strategies. On the other hand, state-space models (SSMs) are tailored for long sequences but cannot capture complicated local information. Therefore, the combination of them as a unified token mixer is a trend in recent long-sequence models. However, the linearized attention degrades performance significantly even when equipped with SSMs. To address the issue, we propose a new method called LongVQ. LongVQ uses the vector quantization (VQ) technique to compress the global abstraction as a length-fixed codebook, enabling the linear-time computation of the attention matrix. This technique effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues. Our experiments on the Long Range Arena benchmark, autoregressive language modeling, and image and speech classification demonstrate the effectiveness of LongVQ. Our model achieves significant improvements over other sequence models, including variants of Transformers, Convolutions, and recent State Space Models.
List of keywords
Machine Learning -> ML: Deep learning architectures
Machine Learning -> ML: Representation learning
Natural Language Processing -> NLP: Embeddings
3286
Unlearning during Learning: An Efficient Federated Machine Unlearning Method
Hanlin Gu, Gongxi Zhu, Jie Zhang, Xinyuan Zhao, Yuxing Han, Lixin Fan, Qiang Yang
6 min. talk | August 8th at 10:00 | Session: ML: Federated learning (2/2)
[+] More
[-] Less
In recent years, Federated Learning (FL) has garnered significant attention as a distributed machine learning paradigm. To facilitate the implementation of the "right to be forgotten," the concept of federated machine unlearning (FMU) has also emerged. However, current FMU approaches often involve additional time-consuming steps and may not offer comprehensive unlearning capabilities, which renders them less practical in real FL scenarios. In this paper, we introduce FedAU, an innovative and efficient FMU framework aimed at overcoming these limitations. Specifically, FedAU incorporates a lightweight auxiliary unlearning module into the learning process and employs a straightforward linear operation to facilitate unlearning. This approach eliminates the requirement for extra time-consuming steps, rendering it well-suited for FL. Furthermore, FedAU exhibits remarkable versatility. It not only enables multiple clients to carry out unlearning tasks concurrently but also supports unlearning at various levels of granularity, including individual data samples, specific classes, and even at the client level. We conducted extensive experiments on MNIST, CIFAR10, and CIFAR100 datasets to evaluate the performance of FedAU. The results demonstrate that FedAU effectively achieves the desired unlearning effect while maintaining model accuracy.
List of keywords
Machine Learning -> ML: Federated learning
Multidisciplinary Topics and Applications -> MTA: Security and privacy
3291
A Behavior-Aware Approach for Deep Reinforcement Learning in Non-stationary Environments without Known Change Points
Zihe Liu, Jie Lu, Guangquan Zhang, Junyu Xuan
6 min. talk | August 8th at 15:00 | Session: ML: Reinforcement learning (2/2)
[+] More
[-] Less
Deep reinforcement learning is used in various domains, but usually under the assumption that the environment has stationary conditions like transitions and state distributions. When this assumption is not met, performance suffers. For this reason, tracking continuous environmental changes and adapting to unpredictable conditions is challenging yet crucial because it ensures that systems remain reliable and flexible in practical scenarios. Our research introduces Behavior-Aware Detection and Adaptation (BADA), an innovative framework that merges environmental change detection with behavior adaptation. The key inspiration behind our method is that policies exhibit different global behaviors in changing environments. Specifically, environmental changes are identified by analyzing variations between behaviors using Wasserstein distances without manually set thresholds. The model adapts to the new environment through behavior regularization based on the extent of changes. The results of a series of experiments demonstrate better performance relative to several current algorithms. This research also indicates significant potential for tackling this long-standing challenge.
List of keywords
Machine Learning -> ML: Reinforcement learning
3294
Welfare Loss in Connected Resource Allocation
Xiaohui Bei, Alexander Lam, Xinhang Lu, Warut Suksompong
6 min. talk | August 8th at 10:00 | Session: GTEP: Fair division
[+] More
[-] Less
We study the allocation of indivisible goods that form an undirected graph and investigate the worst-case welfare loss when requiring that each agent must receive a connected subgraph. Our focus is on both egalitarian and utilitarian welfare. Specifically, we introduce the concept of egalitarian (resp., utilitarian) price of connectivity, which captures the worst-case ratio between the optimal egalitarian (resp., utilitarian) welfare among all allocations and that among the connected allocations. We provide tight or asymptotically tight bounds on the price of connectivity for various large classes of graphs when there are two agents, and for paths, stars and cycles in the general case. Many of our results are supplemented with algorithms which find connected allocations with a welfare guarantee corresponding to the price of connectivity.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Fair division
Agent-based and Multi-agent Systems -> MAS: Resource allocation
3297
CausalNET: Unveiling Causal Structures on Event Sequences by Topology-Informed Causal Attention
Hua Zhu, Hong Huang, Kehan Yin, Zejun Fan, Hai Jin, Bang Liu
6 min. talk | August 7th at 10:00 | Session: UAI: Causality, structural causal models and causal inference
[+] More
[-] Less
Causal discovery on event sequences holds a pivotal significance across domains such as healthcare, finance, and industrial systems. The crux of this endeavor lies in unraveling causal structures among event types, typically portrayed as directed acyclic graphs (DAGs). Nonetheless, prevailing methodologies often grapple with untenable assumptions and intricate optimization hurdles. To address these challenges, we present a novel model named CausalNET. At the heart of CausalNET is a special prediction module based on the Transformer architecture, which prognosticates forthcoming events by leveraging historical occurrences, with its predictive prowess amplified by a trainable causal graph engineered to fathom causal relationships among event types. Further, to augment the predictive paradigm, we devise a causal decay matrix to encapsulate the reciprocal influence of events upon each other within the topological network. During training, we alternatively refine the prediction module and fine-tune the causal graph. Comprehensive evaluation on a spectrum of real-world and synthetic datasets underscores the superior performance and scalability of CausalNET, which marks a promising step forward in the realm of causal discovery. Code and Appendix are available at https://github.com/CGCL-codes/CausalNET.
List of keywords
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
Machine Learning -> ML: Causality
3305
Personalized Heart Disease Detection via ECG Digital Twin Generation
Yaojun Hu, Jintai Chen, Lianting Hu, Dantong Li, Jiahuan Yan, Haochao Ying, Huiying Liang, Jian Wu
6 min. talk | August 9th at 11:30 | Session: MTA: Health and medicine
[+] More
[-] Less
Heart diseases rank among the leading causes of global mortality, demonstrating a crucial need for early diagnosis and intervention. Most traditional electrocardiogram (ECG) based automated diagnosis methods are trained at population level, neglecting the customization of personalized ECGs to enhance individual healthcare management. A potential solution to address this limitation is to employ digital twins to simulate symptoms of diseases in real patients. In this paper, we present an innovative prospective learning approach for personalized heart disease detection, which generates digital twins of healthy individuals’ anomalous ECGs and enhances the model sensitivity to the personalized symptoms. In our approach, a vector quantized feature separator is proposed to locate and isolate the disease symptom and normal segments in ECG signals with ECG report guidance. Thus, the ECG digital twins can simulate specific heart diseases used to train a personalized heart disease detection model. Experiments demonstrate that our approach not only excels in generating high-fidelity ECG signals but also improves personalized heart disease detection. Moreover, our approach ensures robust privacy protection, safeguarding patient data in model development. The code can be found at https://github.com/huyjj/LAVQ-Editor.
List of keywords
Multidisciplinary Topics and Applications -> MTA: Health and medicine
Machine Learning -> ML: Applications
Machine Learning -> ML: Generative models
3314
Label Distribution Learning from Logical Label
Yuheng Jia, Jiawei Tang, Jiahao Jiang
6 min. talk | August 9th at 11:30 | Session: ML: Multi-label learning
[+] More
[-] Less
Label distribution learning (LDL) is an effective method to predict the label description degree (a.k.a. label distribution) of a sample. However, annotating label distribution (LD) for training samples is extremely costly. So recent studies often first use label enhancement (LE) to generate the estimated label distribution from the logical label and then apply external LDL algorithms on the recovered label distribution to predict the label distribution for unseen samples. But this step-wise manner overlooks the possible connections between LE and LDL. Moreover, the existing LE approaches may assign some description degrees to invalid labels. To solve the above problems, we propose a novel method to learn an LDL model directly from the logical label, which unifies LE and LDL into a joint model, and avoids the drawbacks of the previous LE methods. We also give the generalization error bound of our method and theoretically prove that directly learning an LDL model from the logical labels is feasible. Extensive experiments on various datasets prove that the proposed approach can construct a reliable LDL model directly from the logical label, and produce more accurate label distribution than the state-of-the-art LE methods. The code and the supplementary file can be found in https://github.com/seutjw/DLDL.
List of keywords
Machine Learning -> ML: Multi-label learning
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Machine Learning -> ML: Optimization
3319
Optimizing Prosumer Policies in Periodic Double Auctions Inspired by Equilibrium Analysis
Bharat Manvi, Sanjay Chandlekar, Easwar Subramanian
6 min. talk | August 6th at 15:00 | Session: GTEP: Noncooperative games
[+] More
[-] Less
We consider a periodic double auction (PDA) wherein the main participants are wholesale suppliers and brokers representing retailers. The suppliers are represented by a composite supply curve and the brokers are represented by individual bids. Additionally, the brokers can also participate in small-scale selling by placing individual asks; hence, they act as prosumers. Specifically, in a PDA, the prosumers who are net buyers have multiple opportunities to buy or sell multiple units of a commodity with the aim of minimising the cost of buying across multiple rounds of the PDA. Formulating optimal bidding strategies for such a PDA setting involves planning across current and future rounds while taking into account the bidding strategies of other agents. In this work, we propose Markov perfect Nash equilibrium (MPNE) policies for a setup where multiple prosumers with knowledge of the composite supply curve compete to procure commodities. Thereafter, the MPNE policies are used to develop an algorithm called MPNE-BBS for the case wherein the prosumers need to re-construct an approximate composite supply curve using past auction information. The efficacy of the proposed algorithm is demonstrated on the PowerTAC wholesale market simulator against several baselines and state-of-the-art bidding policies.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Auctions and market-based systems
Agent-based and Multi-agent Systems -> MAS: Agent theories and models
Agent-based and Multi-agent Systems -> MAS: Agent-based simulation and emergence
Game Theory and Economic Paradigms -> GTEP: Noncooperative games
3333
Dynamic Weighted Graph Fusion for Deep Multi-View Clustering
Yazhou Ren, Jingyu Pu, Chenhang Cui, Yan Zheng, Xinyue Chen, Xiaorong Pu, Lifang He
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
By exploring complex graph information hidden in data from multiple views, multi-view clustering based on graph neural network significantly enhances the clustering performance and has drawn increasing attention in recent years. Although considerable progress has been made, most existing GNN based MVC models merely consider the explicit presence of graph structure in raw data and ignore that latent graphs of different views also provide specific information for the clustering task. We propose dynamic weighted graph fusion for deep multi-view clustering (DFMVC) to address this issue. Specifically, DFMVC learns embedded features via deep autoencoders and then constructs latent graphs for each individual view. Then, it concatenates the embedded features of all views to form a global feature to leverage complementary information, as well as generates a fusion graph via combining all latent graphs to accurately capture the topological information among samples. Based on the informative fusion graph and global features, the graph convolution module is adopted to derive a representation with global comprehensive information, which is further used to generate pseudo-label information. In a self-supervised manner, such information guides each view to dynamically learn discriminative features and latent graphs. Extensive experimental results demonstrate the efficacy of DFMVC.
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
3339
Dual Semantic Fusion Hashing for Multi-Label Cross-Modal Retrieval
Kaiming Liu, Yunhong Gong, Yu Cao, Zhenwen Ren, Dezhong Peng, Yuan Sun
6 min. talk | August 6th at 11:30 | Session: ML: Multi-modal learning
[+] More
[-] Less
Cross-modal hashing (CMH) has been widely used for multi-modal retrieval tasks due to its low storage cost and fast query speed. Although existing CMH methods achieve promising performance, most of them mainly rely on coarse-grained supervision information (\ie pairwise similarity matrix) to measure the semantic similarities between all instances, ignoring the impact of multi-label distribution. To address this issue, we construct fine-grained semantic similarity to explore the cluster-level semantic relationships between multi-label data, and propose a new dual semantic fusion hashing (DSFH) for multi-label cross-modal retrieval. Specifically, we first learn the modal-specific representation and consensus hash codes, thereby merging the specificity with consistency. Then, we fuse the coarse-grained and fine-grained semantics to mine multiple-level semantic relationships, thereby enhancing hash codes discrimination. Extensive experiments on three benchmarks demonstrate the superior performance of our DSFH compared with 16 state-of-the-art methods.
List of keywords
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Multi-view learning
3347
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension
Jiafeng Liang, Shixin Jiang, Zekun Wang, Haojie Pan, Zerui Chen, Zheng Chu, Ming Liu, Ruiji Fu, Zhongyuan Wang, Bing Qin
6 min. talk | August 6th at 11:30 | Session: CV: Video analysis and understanding   
[+] More
[-] Less
There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the Guide (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with Guide and perform in-depth analysis. Given the diversity and practicality of Guide, we believe that it can be used as a better benchmark for instructional video comprehension.
List of keywords
Computer Vision -> CV: Video analysis and understanding   
Natural Language Processing -> NLP: Resources and evaluation
3355
DiffStega: Towards Universal Training-Free Coverless Image Steganography with Diffusion Models
Yiwei Yang, Zheyuan Liu, Jun Jia, Zhongpai Gao, Yunhao Li, Wei Sun, Xiaohong Liu, Guangtao Zhai
6 min. talk | August 9th at 11:30 | Session: CV: Machine learning for vision
[+] More
[-] Less
Traditional image steganography focuses on concealing one image within another, aiming to avoid steganalysis by unauthorized entities. Coverless image steganography (CIS) enhances imperceptibility by not using any cover image. Recent works have utilized text prompts as keys in CIS through diffusion models. However, this approach faces three challenges: invalidated when private prompt is guessed, crafting public prompts for semantic diversity, and the risk of prompt leakage during frequent transmission. To address these issues, we propose DiffStega, an innovative training-free diffusion-based CIS strategy for universal application. DiffStega uses a password-dependent reference image as an image prompt alongside the text, ensuring that only authorized parties can retrieve the hidden information. Furthermore, we develop Noise Flip technique to further secure the steganography against unauthorized decryption. To comprehensively assess our method across general CIS tasks, we create a dataset comprising various image steganography instances. Experiments indicate substantial improvements in our method over existing ones, particularly in aspects of versatility, password sensitivity, and recovery quality. Codes are available at https://github.com/evtricks/DiffStega.
List of keywords
Computer Vision -> CV: Machine learning for vision
Computer Vision -> CV: Applications
Computer Vision -> CV: Structural and model-based approaches, knowledge representation and reasoning
3356
Computational Complexity of Verifying the Group No-show Paradox
Farhad Mohsin, Qishen Han, Sikai Ruan, Pin-Yu Chen, Francesca Rossi, Lirong Xia
6 min. talk | August 7th at 15:00 | Session: GTEP: Computational social choice (2/2)
[+] More
[-] Less
The (group) no-show paradox refers to the undesirable situation where a group of agents have incentive to abstain from voting to make the winner more favorable to them. To understand whether it is a critical concern in practice, in this paper, we take a computational approach by examining the computational complexity of verifying whether the group no-show paradox exists given agents’ preferences and the voting rule. We prove that, unfortunately, the verification problem is NP-hard to compute for some commonly studied voting rules, i.e., Copeland, maximin, single transferable vote, and all Condorcetified positional scoring rules such as Black’s rule. We propose integer linear programming-based algorithms and a search-based algorithm for the verification problem for different voting rules. Experimental results on synthetic data illustrate that the former is efficient when the number of unique rankings in a profile is not too high, and the latter is efficient for a small number of agents. With the help of these algorithms, we observe that group no-show paradoxes rarely occur in real-world data.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Computational social choice
3359
KG-CoT: Chain-of-Thought Prompting of Large Language Models over Knowledge Graphs for Knowledge-Aware Question Answering
Ruilin Zhao, Feng Zhao, Long Wang, Xianzhi Wang, Guandong Xu
6 min. talk | August 6th at 15:00 | Session: NLP: Natural Language Processing (1/3)
[+] More
[-] Less
Large language models (LLMs) encounter challenges such as hallucination and factual errors in knowledge-intensive tasks. One the one hand, LLMs sometimes struggle to generate reliable answers based on the black-box parametric knowledge, due to the lack of responsible knowledge. Moreover, fragmented knowledge facts extracted by knowledge retrievers fail to provide explicit and coherent reasoning paths for improving LLM reasoning. To address these challenges, we propose KG-CoT, a novel knowledge-augmented paradigm that leverages a small-scale step-by-step graph reasoning model to reason over knowledge graphs (KGs) and utilizes a reasoning path generation method to generate chains of reasoning with high confidence for large-scale LLMs. Extensive experiments demonstrate that our KG-CoT significantly improves the performance of LLMs on knowledge-intensive question answering tasks, such as multi-hop, single-hop, and open-domain question answering benchmarks, without fine-tuning LLMs. KG-CoT outperforms the CoT prompting as well as prior retrieval-augmented and knowledge base question answering baselines. Moreover, KG-CoT can reduce the number of API calls and cost and generalize to various LLM backbones in a lightweight plug-and-play manner.
List of keywords
Natural Language Processing -> NLP: Question answering
Natural Language Processing -> NLP: Language generation
3360
Global Optimality of Single-Timescale Actor-Critic under Continuous State-Action Space: A Study on Linear Quadratic Regulator
Xuyang Chen, Jingliang Duan, Lin Zhao
6 min. talk | August 7th at 15:00 | Session: ML: Reinforcement learning (1/2)
[+] More
[-] Less
Actor-critic methods have achieved state-of-the-art performance in various challenging tasks. However, theoretical understandings of their performance remain elusive and challenging. Existing studies mostly focus on practically uncommon variants such as double-loop or two-timescale stepsize actor-critic algorithms for simplicity. These results certify local convergence on finite state- or action- space only. We push the boundary to investigate the classic single-sample single-timescale actor-critic on continuous (infinite) state-action space, where we employ the canonical linear quadratic regulator (LQR) problem as a case study. We show that the popular single-timescale actor-critic can attain an epsilon-optimal solution with an order of epsilon to -2 sample complexity for solving LQR on the demanding continuous state-action space. Our work provides new insights into the performance of single-timescale actor-critic, which further bridges the gap between theory and practice.
List of keywords
Machine Learning -> ML: Reinforcement learning
Machine Learning -> ML: Learning theory
3364
Learning to Solve Geometry Problems via Simulating Human Dual-Reasoning Process
Tong Xiao, Jiayu Liu, Zhenya Huang, Jinze Wu, Jing Sha, Shijin Wang, Enhong Chen
6 min. talk | August 9th at 11:30 | Session: NLP: Natural Language Processing (3/3)
[+] More
[-] Less
Geometry Problem Solving (GPS), which is a classic and challenging math problem, has attracted much attention in recent years. It requires a solver to comprehensively understand both text and diagram, master essential geometry knowledge, and appropriately apply it in reasoning. However, existing works follow a paradigm of neural machine translation and only focus on enhancing the capability of encoders, which neglects the essential characteristics of human geometry reasoning. In this paper, inspired by dual-process theory, we propose a Dual-Reasoning Geometry Solver (DualGeoSolver) to simulate the dual-reasoning process of humans for GPS. Specifically, we construct two systems in DualGeoSolver, namely Knowledge System and Inference System. Knowledge System controls an implicit reasoning process, which is responsible for providing diagram information and geometry knowledge according to a step-wise reasoning goal generated by Inference System. Inference System conducts an explicit reasoning process, which specifies the goal in each reasoning step and applies the knowledge to generate program tokens for resolving it. The two systems carry out the above process iteratively, which behaves more in line with human cognition. We conduct extensive experiments on two benchmark datasets, GeoQA and GeoQA+. The results demonstrate the superiority of DualGeoSolver in both solving accuracy and robustness from explicitly modeling human reasoning process and knowledge application.
List of keywords
Natural Language Processing -> NLP: Question answering
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
3379
BlockEcho: Retaining Long-Range Dependencies for Imputing Block-Wise Missing Data
Qiao Han, Mingqian Li, Yao Yang, Yiteng Zhai
6 min. talk | August 9th at 11:30 | Session: ML: Generative models
[+] More
[-] Less
Block-wise missing data poses significant challenges in real-world data imputation tasks. Compared to scattered missing data, block-wise gaps exacerbate adverse effects on subsequent analytic and machine learning tasks, as the lack of local neighboring elements significantly reduces the interpolation capability and predictive power. However, this issue has not received adequate attention. Most SOTA matrix completion methods appeared less effective, primarily due to overreliance on neighboring elements for predictions. We systematically analyze the issue and propose a novel matrix completion method "BlockEcho" for a more comprehensive solution. This method creatively integrates Matrix Factorization (MF) within Generative Adversarial Networks (GAN) to explicitly retain long-distance inter-element relationships in the original matrix. Besides, we incorporate an additional discriminator for GAN, comparing the generator’s intermediate progress with pre-trained MF results to constrain high-order feature distributions. Subsequently, we evaluate BlockEcho on public datasets across three domains. Results demonstrate superior performance over both traditional and SOTA methods when imputing block-wise missing data, especially at higher missing rates. The advantage also holds for scattered missing data at high missing rates. We also contribute on the analyses in providing theoretical justification on the optimality and convergence of fusing MF and GAN for missing block data.
List of keywords
Machine Learning -> ML: Generative models
Data Mining -> DM: Other
3383
A Grassmannian Manifold Self-Attention Network for Signal Classification
Rui Wang, Chen Hu, Ziheng Chen, Xiao-Jun Wu, Xiaoning Song
6 min. talk | August 7th at 15:00 | Session: ML: Machine Learning (4/6)
[+] More
[-] Less
In the community of artificial intelligence, significant progress has been made in encoding sequential data using deep learning techniques. Nevertheless, how to effectively mine useful information from channel dimensions remains a major challenge, as these features have a submanifold structure. Linear subspace, the basic element of the Grassmannian manifold, has proven to be an effective manifold-valued feature descriptor in statistical representation. Besides, the Euclidean self-attention mechanism has shown great success in capturing long-range relationships of data. Inspired by these facts, we extend the self-attention mechanism to the Grassmannian manifold. Our framework can effectively characterize the spatiotemporal fluctuations of sequential data encoded in the Grassmannian manifold. Extensive experimental results on three benchmarking datasets (a drone recognition dataset and two EEG signal classification datasets) demonstrate the superiority of our method over the state-of-the-art. The code and supplementary material for this work can be found at https://github.com/ChenHu-ML/GDLNet.
List of keywords
Machine Learning -> ML: Attention models
Machine Learning -> ML: Classification
Machine Learning -> ML: Geometric learning
3384
Fair Distribution of Delivery Orders
Hadi Hosseini, Shivika Narang, Tomasz Wąs
6 min. talk | August 8th at 10:00 | Session: GTEP: Fair division
[+] More
[-] Less
We initiate the study of fair distribution of delivery tasks among a set of agents wherein delivery jobs are placed along the vertices of a graph. Our goal is to fairly distribute delivery costs (modeled as a submodular function) among a fixed set of agents while satisfying some desirable notions of economic efficiency. We adopt well-established fairness concepts—such as envy-freeness up to one item (EF1) and minimax share (MMS)—to our setting and show that fairness is often incompatible with the efficiency notion of social optimality. Yet, we characterize instances that admit fair and socially optimal solutions by exploiting graph structures. We further show that achieving fairness along with Pareto optimality is computationally intractable. Nonetheless, we design an XP algorithm (parameterized by the number of agents) for finding MMS and Pareto optimal solutions on every tree instance, and show that the same algorithm can be modified to find efficient solutions along with EF1, when such solutions exist. We complement these results by theoretically and experimentally analyzing the price of fairness.
List of keywords
Game Theory and Economic Paradigms -> GTEP: Fair division
Game Theory and Economic Paradigms -> GTEP: Computational social choice
3391
Unified View Imputation and Feature Selection Learning for Incomplete Multi-view Data
Yanyong Huang, Zongxin Shen, Tianrui Li, Fengmao Lv
6 min. talk | August 8th at 15:00 | Session: ML: Machine Learning (6/6)
[+] More
[-] Less
Although multi-view unsupervised feature selection (MUFS) is an effective technology for reducing dimensionality in machine learning, existing methods cannot directly deal with incomplete multi-view data where some samples are missing in certain views. These methods should first apply predetermined values to impute missing data, then perform feature selection on the complete dataset. Separating imputation and feature selection processes fails to capitalize on the potential synergy where local structural information gleaned from feature selection could guide the imputation, thereby improving the feature selection performance in turn. Additionally, previous methods only focus on leveraging samples’ local structure information, while ignoring the intrinsic locality of the feature space. To tackle these problems, a novel MUFS method, called UNified view Imputation and Feature selectIon lEaRning (UNIFIER), is proposed. UNIFIER explores the local structure of multi-view data by adaptively learning similarity-induced graphs from both the sample and feature spaces. Then, UNIFIER dynamically recovers the missing views, guided by the sample and feature similarity graphs during the feature selection procedure. Furthermore, the half-quadratic minimization technique is used to automatically weight different instances, alleviating the impact of outliers and unreliable restored data. Comprehensive experimental results demonstrate that UNIFIER outperforms other state-of-the-art methods.
List of keywords
Machine Learning -> ML: Feature extraction, selection and dimensionality reduction
Machine Learning -> ML: Unsupervised learning
3406
SaSDim:Self-Adaptive Noise Scaling Diffusion Model for Spatial Time Series Imputation
Shunyang Zhang, Senzhang Wang, Xianzhen Tan, Renzhi Wang, Ruochen Liu, Jian Zhang, Jianxin Wang
6 min. talk | August 7th at 10:00 | Session: DM: Mining spatial and/or temporal data (1/2)
[+] More
[-] Less
Spatial time series imputation is of great importance to various real-world applications. As the state-of-the-art generative models, diffusion models (e.g. CSDI) have outperformed statistical and autoregressive based models in time series imputation. However, diffusion models may introduce unstable noise owing to the inherent uncertainty in sampling, leading to the generated noise deviating from the intended Gaussian distribution. Consequently, the imputed data may deviate from the real data. To this end, we propose a Self-adaptive noise Scaling Diffusion Model named SaSDim for spatial time series imputation. Specifically, we introduce a novel Probabilistic High-Order SDE Solver Module to scale the noise following the standard Gaussian distribution. The noise scaling operation helps the noise prediction module of the diffusion model to more accurately estimate the variance of noise. To effectively learn the spatial and temporal features, a Spatial guided Global Convolution Module (SgGConv) for multi-periodic temporal dependencies learning with the Fast Fourier Transformation and dynamic spatial dependencies learning with dynamic graph convolution is also proposed. Extensive experiments conducted on three real-world spatial time series datasets verify the effectiveness of SaSDim.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
3412
vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement
Yiwen Zhu, Jinyi Liu, Wenya Wei, Qianyi Fu, Yujing Hu, Zhou Fang, Bo An, Jianye Hao, Tangjie Lv, Changjie Fan
6 min. talk | August 8th at 15:00 | Session: ML: Reinforcement learning (2/2)
[+] More
[-] Less
Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations — policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.
List of keywords
Machine Learning -> ML: Reinforcement learning
3415
Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval
Xiaobo Shen, Qianxin Huang, Long Lan, Yuhui Zheng
6 min. talk | August 7th at 10:00 | Session: CV: Image and video retrieval 
[+] More
[-] Less
As video-based social networks continue to grow exponentially, there is a rising interest in video retrieval using natural language. Cross-modal hashing, which learns compact hash code for encoding multi-modal data, has proven to be widely effective in large-scale cross-modal retrieval, e.g., image-text retrieval, primarily due to its computation and storage efficiency. However, when applied to video-text retrieval, existing cross-modal hashing methods generally extract features at the frame- or word-level for videos and texts individually, thereby ignoring their long-term dependencies. To address this issue, we propose Contrastive Transformer Cross-Modal Hashing (CTCH), a novel approach designed for video-text retrieval task. CTCH employs bidirectional transformer encoder to encode video and text and leverages their long-term dependencies. CTCH further introduces supervised multi-modality contrastive loss that effectively exploits inter-modality and intra-modality similarities among videos and texts. The experimental results on three video benchmark datasets demonstrate that CTCH outperforms the state-of-the-arts in video-text retrieval tasks.
List of keywords
Computer Vision -> CV: Image and video retrieval 
Machine Learning -> ML: Multi-modal learning
Machine Learning -> ML: Multi-view learning
3435
Unsupervised Deep Graph Structure and Embedding Learning
Xiaobo Shen, Lei Shi, Xiuwen Gong, Shirui Pan
6 min. talk | August 9th at 11:30 | Session: DM: Mining graphs (3/3)
[+] More
[-] Less
Graph Neural Network (GNN) is powerful in graph embedding learning, but its performance has been shown to be heavily degraded under adversarial attacks. Deep graph structure learning (GSL) is proposed to defend attack by jointly learning graph structure and graph embedding, typically in node classification task. Label supervision is expensive in real-world applications, and thus unsupervised GSL is more challenging and still remains less studied. To fulfill this gap, this paper proposes a new unsupervised GSL method, i.e., unsupervised property GNN (UPGNN). UPGNN first refines graph structure by exploring properties of low rank, sparsity, feature smoothness. UPGNN employs graph mutual information loss to learn graph embedding by maximizing its correlation with refined graph. The proposed UPGNN learns graph structure and embedding without label supervision, and thus can be applied various downstream tasks. We further propose Accelerated UPGNN (AUPGNN) to reduce computational complexity, providing a efficient alternative to UPGNN. Our extensive experiments on node classification and clustering demonstrate the effectiveness of the proposed method over the state-of-the-arts especially under heavy perturbation.
List of keywords
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Sequence and graph learning
Machine Learning -> ML: Unsupervised learning
3450
Group-Aware Coordination Graph for Multi-Agent Reinforcement Learning
Wei Duan, Jie Lu, Junyu Xuan
6 min. talk | August 6th at 11:30 | Session: ML: Multiagent Reinforcement Learning
[+] More
[-] Less
Cooperative Multi-Agent Reinforcement Learning (MARL) necessitates seamless collaboration among agents, often represented by an underlying relation graph. Existing methods for learning this graph primarily focus on agent-pair relations, neglecting higher-order relationships. While several approaches attempt to extend cooperation modelling to encompass behaviour similarities within groups, they commonly fall short in concurrently learning the latent graph, thereby constraining the information exchange among partially observed agents. To overcome these limitations, we present a novel approach to infer the Group-Aware Coordination Graph (GACG), which is designed to capture both the cooperation between agent pairs based on current observations and group-level dependencies from behaviour patterns observed across trajectories. This graph is further used in graph convolution for information exchange between agents during decision-making. To further ensure behavioural consistency among agents within the same group, we introduce a group distance loss, which promotes group cohesion and encourages specialization between groups. Our evaluations, conducted on StarCraft II micromanagement tasks, demonstrate GACG’s superior performance. An ablation study further provides experimental evidence of the effectiveness of each component of our method.
List of keywords
Machine Learning -> ML: Multiagent Reinforcement Learning
Agent-based and Multi-agent Systems -> MAS: Coordination and cooperation
3452
Make Bricks with a Little Straw: Large-Scale Spatio-Temporal Graph Learning with Restricted GPU-Memory Capacity
Binwu Wang, Pengkun Wang, Zhengyang Zhou, Zhe Zhao, Wei Xu, Yang Wang
6 min. talk | August 8th at 10:00 | Session: DM: Mining spatial and/or temporal data (2/2)
[+] More
[-] Less
Traffic prediction plays a key role in various smart city applications, which can help traffic managers make traffic plans in advance, assist online ride-hailing companies in deploying vehicles reasonably, and provide early warning of congestion for safety authorities. While increasingly complex models achieve impressive prediction performance, there are concerns about the effectiveness of these models in handling large-scale road networks. Especially for researchers who don’t have access to powerful GPU devices, the expensive memory burden limits the usefulness of these models. In this paper, we take the first step of learning on the large-scale spatio-temporal graph and propose a divide-and-conquer training strategy for Large Spatio-Temporal Graph Learning, namely LarSTL. The core idea behind this strategy is to divide the large graph into multiple subgraphs, which are treated as task streams to sequentially train the model to conquer each subgraph one by one. We introduce a novel perspective based on the continuous learning paradigm to achieve this goal. In order to overcome forgetting the knowledge learned from previous subgraphs, an experience-replay strategy consolidates the learned knowledge by replaying nodes sampled from previous subgraphs. Moreover, we configure specific feature adaptors for each subgraph to extract personalized features, and it is also beneficial to consolidate the learned knowledge from the perspective of parameters. We conduct experiments using multiple large-scale traffic network datasets on a V100 GPU with only 16GB memory, and the results demonstrate that our LarSTL can achieve competitive performance and high efficiency.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
Data Mining -> DM: Big data and scalability
Data Mining -> DM: Mining graphs
3466
Active Deep Multi-view Clustering
Helin Zhao, Wei Chen, Peng Zhou
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
Deep multi-view clustering has been widely studied. However, since it is an unsupervised task, where no labels are used to guide the training, it is still unreliable especially when handling complicated data. Although deep semi-supervised multi-view clustering can alleviate this problem by using some supervised information, the supervised information is often pregiven or randomly selected. Unfortunately, as we know, the clustering performance highly depends on the quality of the supervised information and most of the semi-supervised methods ignore the supervised information selection. To tackle this problem, in this paper, we propose a novel active deep multi-view clustering method, which can actively select important data for querying human annotations. In this method, we carefully design a fusion module, an active selection module, a supervised module, and an unsupervised module, and integrate them into a unified framework seamlessly. In this framework, we can obtain a more reliable clustering result with as few annotations as possible. The extensive experiments on benchmark data sets show that our method can outperform state-of-the-art unsupervised and semi-supervised methods, demonstrating the effectiveness and superiority of the proposed method. The code is available at https://github.com/wodedazhuozi/ADMC .
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Active learning
Machine Learning -> ML: Multi-modal learning
3473
Decoupled Invariant Attention Network for Multivariate Time-series Forecasting
Haihua Xu, Wei Fan, Kun Yi, Pengyang Wang
6 min. talk | August 8th at 10:00 | Session: DM: Mining spatial and/or temporal data (2/2)
[+] More
[-] Less
To achieve more accurate prediction results in Time Series Forecasting (TSF), it is essential to distinguish between the valuable patterns (invariant patterns) of the spatial-temporal relationship and the patterns that are prone to generate distribution shift (variant patterns), then combine them for forecasting.The existing works, such as transformer-based models and GNN-based models, focus on capturing main forecasting dependencies whether it is stable or not, and they tend to overlook patterns that carry both useful information and distribution shift. In this paper, we propose a model for better forecasting time series: Decoupled Invariant Attention Network (DIAN), which contains two modules to learn spatial and temporal relationships respectively: 1) Spatial Decoupled Invariant-Variant Learning (SDIVL) to decouple the spatial invariant and variant attention scores, and then leverage convolutional networks to effectively integrate them for subsequent layers; 2) Temporal Augmented Invariant-Variant Learning (TAIVL) to decouple temporal invariant and variant patterns and combine them for further forecasting.In this module, we also design Temporal Intervention Mechanism to create multiple intervened samples by reassembling variant patterns across time stamps to eliminate the spurious impacts of variant patterns.In addition, we propose Joint Optimization to minimize the loss function considering all invariant patterns, variant patterns and intervened patterns so that our model can gain a more stable predictive ability.Extensive experiments on five datasets have demonstrated our superior performance with higher efficiency compared with state-of-the-art methods.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
3481
DGR: A General Graph Desmoothing Framework for Recommendation via Global and Local Perspectives
Leilei Ding, Dazhong Shen, Chao Wang, Tianfu Wang, Le Zhang, Yanyong Zhang
6 min. talk | August 8th at 15:00 | Session: DM: Data Mining (2/2)
[+] More
[-] Less
Graph Convolutional Networks (GCNs) have become pivotal in recommendation systems for learning user and item embeddings by leveraging the user-item interaction graph’s node information and topology. However, these models often face the famous over-smoothing issue, leading to indistinct user and item embeddings and reduced personalization. Traditional desmoothing methods in GCN-based systems are model-specific, lacking a universal solution. This paper introduces a novel, model-agnostic approach named Desmoothing Framework for GCN-based Recommendation Systems (DGR). It effectively addresses over-smoothing on general GCN-based recommendation models by considering both global and local perspectives. Specifically, we first introduce vector perturbations during each message passing layer to penalize the tendency of node embeddings approximating overly to be similar with the guidance of the global topological structure. Meanwhile, we further develop a tailored-design loss term for the readout embeddings to preserve the local collaborative relations between users and their neighboring items. In particular, items that exhibit a high correlation with neighboring items are also incorporated to enhance the local topological information. To validate our approach, we conduct extensive experiments on 5 benchmark datasets based on 5 well-known GCN-based recommendation models, demonstrating the effectiveness and generalization of our proposed framework. Our code is available at https://github.com/me-sonandme/DGR.
List of keywords
Data Mining -> DM: Collaborative filtering
Data Mining -> DM: Recommender systems
3488
Denoising-Aware Contrastive Learning for Noisy Time Series
Shuang Zhou, Daochen Zha, Xiao Shen, Xiao Huang, Rui Zhang, Korris Chung
6 min. talk | August 7th at 11:30 | Session: ML: Self-supervised Learning
[+] More
[-] Less
Time series self-supervised learning (SSL) aims to exploit unlabeled data for pre-training to mitigate the reliance on labels. Despite the great success in recent years, there is limited discussion on the potential noise in the time series, which can severely impair the performance of existing SSL methods. To mitigate the noise, the de facto strategy is to apply conventional denoising methods before model training. However, this pre-processing approach may not fully eliminate the effect of noise in SSL for two reasons: (i) the diverse types of noise in time series make it difficult to automatically determine suitable denoising methods; (ii) noise can be amplified after mapping raw data into latent space. In this paper, we propose denoising-aware contrastive learning (DECL), which uses contrastive learning objectives to mitigate the noise in the representation and automatically selects suitable denoising methods for every sample. Extensive experiments on various datasets verify the effectiveness of our method. The code is open-sourced.
List of keywords
Machine Learning -> ML: Self-supervised Learning
Machine Learning -> ML: Classification
Machine Learning -> ML: Representation learning
Machine Learning -> ML: Time series and data streams
3496
Improving Pseudo Labels with Global-Local Denoising Framework for Cross-lingual Named Entity Recognition
Zhuojun Ding, Wei Wei, Xiaoye Qu, Dangyang Chen
6 min. talk | August 6th at 15:00 | Session: NLP: Natural Language Processing (1/3)
[+] More
[-] Less
Cross-lingual named entity recognition (NER) aims to train an NER model for the target language leveraging only labeled source language data and unlabeled target language data. Prior approaches either perform label projection on translated source language data or employ a source model to assign pseudo labels for target language data and train a target model on these pseudo-labeled data to generalize to the target language. However, these automatic labeling procedures inevitably introduce noisy labels, thus leading to a performance drop. In this paper, we propose a Global-Local Denoising framework (GLoDe) for cross-lingual NER. Specifically, GLoDe introduces a progressive denoising strategy to rectify incorrect pseudo labels by leveraging both global and local distribution information in the semantic space. The refined pseudo-labeled target language data significantly improves the model’s generalization ability. Moreover, previous methods only consider improving the model with language-agnostic features, however, we argue that target language-specific features are also important and should never be ignored. To this end, we employ a simple auxiliary task to achieve this goal. Experimental results on two benchmark datasets with six target languages demonstrate that our proposed GLoDe significantly outperforms current state-of-the-art methods.
List of keywords
Natural Language Processing -> NLP: Named entities
Natural Language Processing -> NLP: Information extraction
3498
BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion
Yonghao Yu, Shunan Zhu, Huai Qin, Haorui Li
6 min. talk | August 9th at 11:30 | Session: ML: Generative models
[+] More
[-] Less
Witnessing the evolution of text-to-image diffusion models, significant strides have been made in text-to-3D generation. Currently, two primary paradigms dominate the field of text-to-3D: the feed-forward generation solutions, capable of swiftly producing 3D assets but often yielding coarse results, and the Score Distillation Sampling (SDS) based solutions, known for generating high-fidelity 3D assets albeit at a slower pace. The synergistic integration of these methods holds substantial promise for advancing 3D generation techniques. In this paper, we present BoostDream, a highly efficient plug-and-play 3D refining method designed to transform coarse 3D assets into high-quality. The BoostDream framework comprises three distinct processes: (1) We introduce 3D model distillation that fits differentiable representations from the 3D assets obtained through feed-forward generation. (2) A novel multi-view SDS loss is designed, which utilizes a multi-view aware 2D diffusion model to refine the 3D assets. (3) We propose to use prompt and multi-view consistent normal maps as guidance in refinement. Our extensive experiment is conducted on different differentiable 3D representations, revealing that BoostDream excels in generating high-quality 3D assets rapidly, overcoming the Janus problem compared to conventional SDS-based methods. This breakthrough signifies a substantial advancement in both the efficiency and quality of 3D generation processes.
List of keywords
Machine Learning -> ML: Generative models
Computer Vision -> CV: 3D computer vision
Multidisciplinary Topics and Applications -> MTA: Arts and creativity
3519
M2Beats: When Motion Meets Beats in Short-form Videos
Dongxiang Jiang, Yongchang Zhang, Shuai He, Anlong Ming
6 min. talk | August 8th at 15:00 | Session: CV: Image and video synthesis and generation (2/2)
[+] More
[-] Less
In recent years, short-form videos have gained popularity and the editing of these videos, particularly when motion is synchronized with music, is highly favored due to its beat-matching effect. However, detecting motion rhythm poses a significant challenge as it is influenced by multiple factors that make it difficult to define using explicit rules. While traditional methods attempt to define motion rhythm, they often yield unsatisfactory results. On the other hand, learning-based methods can extract motion rhythm without relying on explicit rules but require high-quality datasets. Unfortunately, existing datasets simply substitute music rhythm for motion rhythm which are not equivalent. To address these challenges, we present the motion rhythm dataset AIST-M2B, which is annotated with meticulously curated motion rhythm labels derived from the profound correlation between motion and music in professional dance. We propose a novel network architecture called M2BNet that is specifically trained on AIST-M2B to effectively extract intricate motion rhythms by incorporating both human body structure and temporal information. Additionally, we introduce a pioneering algorithm for enhancing motion rhythm synchronization with beats. Experimental results substan- tiate the superior performance of our method compared to other existing algorithms in the domain of motion rhythm analysis. Our code is available at https://github.com/mRobotit/M2Beats.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Image and video retrieval 
Computer Vision -> CV: Motion and tracking
3521
Advancing Medical Image Segmentation via Self-supervised Instance-adaptive Prototype Learning
Guoyan Liang, Qin Zhou, Jingyuan Chen, Zhe Wang, Chang Yao
6 min. talk | August 9th at 10:00 | Session: CV: Biomedical image analysis
[+] More
[-] Less
Medical Image Segmentation (MIS) plays a crucial role in medical therapy planning and robot navigation. Prototype learning methods in MIS focus on generating segmentation masks through pixel-to-prototype comparison. However, current approaches often overlook sample diversity by using a fixed prototype per semantic class and neglect intra-class variation within each input. In this paper, we propose to generate instance-adaptive prototypes for MIS, which integrates a common prototype proposal (CPP) capturing common visual patterns and an instance-specific prototype proposal (IPP) tailored to each input. To further account for the intra-class variation, we propose to guide the IPP generation by re-weighting the intermediate feature map according to their confidence scores. These confidence scores are hierarchically generated using a transformer decoder. Additionally we introduce a novel self-supervised filtering strategy to prioritize the foreground pixels during the training of the transformer decoder. Extensive experiments demonstrate favorable performance of our method.
List of keywords
Computer Vision -> CV: Biomedical image analysis
Computer Vision -> CV: Representation learning
Computer Vision -> CV: Segmentation
Machine Learning -> ML: Self-supervised Learning
3525
Spatial-Temporal Perceiving: Deciphering User Hierarchical Intent in Session-Based Recommendation
Xiao Wang, Tingting Dai, Qiao Liu, Shuang Liang
6 min. talk | August 9th at 10:00 | Session: DM: Recommender systems
[+] More
[-] Less
Session-based recommendation (SBR) aims to predict the next-interacted item based on anonymous users’ behavior sequences. The main challenge is how to recognize the user intent with limited interactions to achieve a more accurate inference of user behavior. Existing works usually regard several consecutive items in the current session as intent. However, we argue such intent generation based on temporal transition ignores the fact that each item also has its semantically connected items in the feature space, which can be regarded as spatial intent. The limited consideration of intent fails to capture complex behavioral patterns in real-world scenarios, leading to sub-optimal solutions. To address this issue, we propose the Hierarchical Intent Perceiving Contrastive Learning Framework (HearInt) for SBR, which proposes a hierarchical consideration of intents from both temporal and spatial perspective. Specifically, we first propose that the user’s temporal intents are mutually exclusive while the spatial intents are mutually compatible. Following these analyses, we design a Temporal Intent Decoupling module to mitigate the mutual influence of long-term and short-term intents, and a Cross-scale Contrastive Learning task to enhance the consistency of intents across different spatial scales. Experimental results on three real-world datasets exhibit that HearInt achieves state-of-the-art performance.
List of keywords
Data Mining -> DM: Recommender systems
3528
Towards Robust Trajectory Representations: Isolating Environmental Confounders with Causal Learning
Kang Luo, Yuanshao Zhu, Wei Chen, Kun Wang, Zhengyang Zhou, Sijie Ruan, Yuxuan Liang
12 min. talk | August 6th at 15:00 | Session: MTA: Transportation
[+] More
[-] Less
Trajectory modeling refers to characterizing human movement behavior, serving as a pivotal step in understanding mobility patterns. Nevertheless, existing studies typically ignore the confounding effects of geospatial context, leading to the acquisition of spurious correlations and limited generalization capabilities. To bridge this gap, we initially formulate a Structural Causal Model (SCM) to decipher the trajectory representation learning process from a causal perspective. Building upon the SCM, we further present a Trajectory modeling framework (TrajCL) based on Causal Learning, which leverages the backdoor adjustment theory as an intervention tool to eliminate the spurious correlations between geospatial context and trajectories. Extensive experiments on two real-world datasets verify that TrajCL markedly enhances performance in trajectory classification tasks while showcasing superior generalization and interpretability.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
Multidisciplinary Topics and Applications -> MTA: Transportation
3538
Practical Hybrid Gradient Compression for Federated Learning Systems
Sixu Hu, Linshan Jiang, Bingsheng He
6 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
The high communication cost is a major challenge in the federated learning (FL) training process. Several methods have been proposed to reduce communication costs on the uplink channel, primarily sparsification-based methods, which have overlooked the impact of downlink channels. However, model accuracy and communication cost issues arise when applying them in practical FL applications, especially when the bandwidth is limited both on the uplink and downlink channels. In this paper, we propose a novel secure-FL-compatible hybrid gradient compression framework (HGC) that handles both uplink and downlink communication. Specifically, HGC identifies and exploits three types of redundancies in the FL training process. With proposed optimization methods based on compression ratio correction and dynamic momentum correction, HGC improves the trade-off between communication cost and model performance. The extensive theoretical and empirical analysis demonstrates the effectiveness of our framework in achieving a high compression ratio for both uplink and downlink communications with negligible loss of model accuracy, surpassing the state-of-the-art compression methods.
List of keywords
Machine Learning -> ML: Federated learning
3549
Make Graph Neural Networks Great Again: A Generic Integration Paradigm of Topology-Free Patterns for Traffic Speed Prediction
Yicheng Zhou, Pengfei Wang, Hao Dong, Denghui Zhang, Dingqi Yang, Yanjie Fu, Pengyang Wang
6 min. talk | August 8th at 10:00 | Session: DM: Mining spatial and/or temporal data (2/2)
[+] More
[-] Less
Urban traffic speed prediction aims to estimate the future traffic speed for improving urban transportation services. Enormous efforts have been made to exploit Graph Neural Networks (GNNs) for modeling spatial correlations and temporal dependencies of traffic speed evolving patterns, regularized by graph topology. While achieving promising results, current traffic speed prediction methods still suffer from ignoring topology-free patterns, which cannot be captured by GNNs. To tackle this challenge, we propose a generic model for enabling the current GNN-based methods to preserve topology-free patterns. Specifically, we first develop a Dual Cross-Scale Transformer (DCST) architecture, including a Spatial Transformer and a Temporal Transformer, to preserve the cross-scale topology-free patterns and associated dynamics, respectively. Then, to further integrate both topology-regularized/-free patterns, we propose a distillation-style learning framework, in which the existing GNN-based methods are considered as the teacher model, and the proposed DCST architecture is considered as the student model. The teacher model would inject the learned topology-regularized patterns into the student model for integrating topology-free patterns. The extensive experimental results demonstrated the effectiveness of our methods.
List of keywords
Data Mining -> DM: Mining spatial and/or temporal data
3577
AK4Prompts: Aesthetics-driven Automatically Keywords-Ranking for Prompts in Text-To-Image Models
Haiyang Zhang, Mengchao Wang, Shuai He, Anlong Ming
6 min. talk | August 8th at 11:30 | Session: CV: Image and video synthesis and generation (1/2)
[+] More
[-] Less
Current text-to-image synthesis (TIS) models have demonstrated the ability to generate high-fidelity images based on textual prompts. However, the efficacy of these models heavily relies on the keywords present in the prompts, and there is a dearth of objective analysis regarding how different keywords impact the ultimate quality of generated results. Therefore, manual evaluation becomes necessary but limited and inefficient to ascertain the role played by keywords. In this paper, we propose automated keywords-ranking for prompts (AK4Prompts), a keyword evaluation model based on mainstream TIS models that explicitly quantifies the multidimensional impact of various keywords on image generation based on prompts. To enable personalized keyword evaluation based on prompt content, we propose decoupling the latent representations of keywords and prompts in TIS models, followed by integrating the semantic features of prompts into keywords. For quantitative and multidimensional evaluation, we align the fused features of keywords using HPSv2, aesthetic score, and CLIP score, each representing distinct factors contributing to keyword impact. Our AK4Prompts can flexibly and automatically select the keywords that best match the original prompt based on individual user preferences. Extensive experimental results show the superiority of AK4Prompts to improve the quality of generated images significantly over strong baselines. Our approach not only enhances usability and user experience but also addresses the current gap in automated analysis and evaluation of keyword effects. Our code is availableat https://github.com/mRobotit/AK4Prompts.
List of keywords
Computer Vision -> CV: Image and video synthesis and generation 
Computer Vision -> CV: Computational photography
Computer Vision -> CV: Machine learning for vision
3578
Label-efficient Semantic Scene Completion with Scribble Annotations
Song Wang, Jiawei Yu, Wentong Li, Hao Shi, Kailun Yang, Junbo Chen, Jianke Zhu
6 min. talk | August 9th at 11:30 | Session: CV: Computer Vision (1/2)
[+] More
[-] Less
Semantic scene completion aims to infer the 3D geometric structures with semantic classes from camera or LiDAR, which provide essential occupancy information in autonomous driving. Prior endeavors concentrate on constructing the network or benchmark in a fully supervised manner. While the dense occupancy grids need point-wise semantic annotations, which incur expensive and tedious labeling costs. In this paper, we build a new label-efficient benchmark, named ScribbleSC, where the sparse scribble-based semantic labels are combined with dense geometric labels for semantic scene completion. In particular, we propose a simple yet effective approach called Scribble2Scene, which bridges the gap between the sparse scribble annotations and fully-supervision. Our method consists of geometric-aware auto-labelers construction and online model training with an offline-to-online distillation module to enhance the performance. Experiments on SemanticKITTI demonstrate that Scribble2Scene achieves competitive performance against the fully-supervised counterparts, showing 99% performance of the fully-supervised models with only 13.5% voxels labeled. Both annotations of ScribbleSC and our full implementation are available at https://github.com/songw-zju/Scribble2Scene.
List of keywords
Computer Vision -> CV: Scene analysis and understanding   
Computer Vision -> CV: Applications
3582
Generalized Taxonomy-Guided Graph Neural Networks
Yu Zhou, Di Jin, Jianguo Wei, Dongxiao He, Zhizhi Yu, Weixiong Zhang
6 min. talk | August 6th at 15:00 | Session: DM: Mining graphs (2/3)
[+] More
[-] Less
Graph neural networks have been demonstrated to be effective analytic apparatus for mining network data. Most real-world networks are inherently hierarchical, offering unique opportunities to acquire latent, intrinsic network organizational properties by utilizing network taxonomies. The existing approaches for learning implicit hierarchical network structures focus on introducing taxonomy to graph neural networks but often run short of exploiting the rich network semantics and structural properties in the taxonomy, resulting in poor generalizability and reusability. To address these issues, we propose generalized Taxonomy-Guided Graph Neural Networks (TG-GNN) to integrate taxonomy into network representation learning. We first construct a taxonomy representation learning module that introduces the concept of ego network to propagate and aggregate rich semantic and structural information in the taxonomy. We then design a taxonomy-guided Markov mechanism, which encapsulates taxonomy knowledge in pairwise potential functions, to refine network embeddings. Extensive experiments on various real-world networks illustrate the effectiveness of TG-GNN over the state-of-the-art methods on scenarios involving incomplete taxonomies and inductive settings.
List of keywords
Data Mining -> DM: Mining graphs
Machine Learning -> ML: Sequence and graph learning
3586
Individual Causal Structure Learning from Population Data
Wei Chen, Xiaokai Huang, Zijian Li, Ruichu Cai, Zhiyi Huang, Zhifeng Hao
6 min. talk | August 7th at 10:00 | Session: UAI: Causality, structural causal models and causal inference
[+] More
[-] Less
Learning the causal structure of each individual plays a crucial role in neuroscience, biology, and so on. Existing methods consider data from each individual separately, which may yield inaccurate causal structure estimations in limited samples. To leverage more samples, we consider incorporating data from all individuals as population data. We observe that the variables of all individuals are influenced by the common environment variables they share. These shared environment variables can be modeled as latent variables and serve as a bridge connecting data from different individuals. In particular, we propose an Individual Linear Acyclic Model (ILAM) for each individual from population data, which models the individual’s variables as being linearly influenced by their parents, in addition to environment variables and noise terms. Theoretical analysis shows that the model is identifiable when all environment variables are non-Gaussian, or even if some are Gaussian with an adequate diversity in the variance of noises for each individual. We then develop an individual causal structures learning method based on the Share Independence Component Analysis technique. Experimental results on synthetic and real-world data demonstrate the correctness of the method even when the sample size of each individual’s data is small.
List of keywords
Uncertainty in AI -> UAI: Causality, structural causal models and causal inference
3589
Robust Contrastive Multi-view Kernel Clustering
Peng Su, Yixi Liu, Shujian Li, Shudong Huang, Jiancheng Lv
6 min. talk | August 6th at 15:00 | Session: ML: Multi-view learning
[+] More
[-] Less
Multi-view kernel clustering (MKC) aims to fully reveal the consistency and complementarity of multiple views in a potential Hilbert space, thereby enhancing clustering performance. The clustering results of most MKC methods are highly sensitive to the quality of the constructed kernels, as traditional methods independently compute kernel matrices for each view without fully considering complementary information across views. In previous contrastive multi-view kernel learning, the goal was to bring cross-view instances of the same sample closer during the kernel construction process while pushing apart instances across samples to achieve a comprehensive integration of cross-view information. However, its inherent drawback is the potential inappropriate amplification of distances between different instances of the same clusters (i.e., false negative pairs) during the training process, leading to a reduction in inter-class discriminability. To address this challenge, we propose a Robust Contrastive multi-view kernel Learning approach (R-CMK) against false negative pairs. It partitions negative pairs into different intervals based on distance or similarity, and for false negative pairs, reverses their optimization gradient. This effectively avoids further amplification of distances for false negative pairs while simultaneously pushing true negative pairs farther apart. We conducted comprehensive experiments on various MKC methods to validate the effectiveness of the proposed method. The code is available at https://github.com/Duo-laimi/rcmk_main.
List of keywords
Machine Learning -> ML: Multi-view learning
Machine Learning -> ML: Clustering
Machine Learning -> ML: Kernel methods
3593
Regression Residual Reasoning with Pseudo-labeled Contrastive Learning for Uncovering Multiple Complex Compositional Relations
Chengtai Li, Yuting He, Jianfeng Ren, Ruibin Bai, Yitian Zhao, Heng Yu, Xudong Jiang
6 min. talk | August 8th at 10:00 | Session: KRR: Learning and reasoning
[+] More
[-] Less
Abstract Visual Reasoning (AVR) has been widely studied in literature. Our study reveals that AVR models tend to rely on appearance matching rather than a genuine understanding of underlying rules. We hence develop a challenging benchmark, Multiple Complex Compositional Reasoning (MC2R), composed of diverse compositional rules on attributes with intentionally increased variations. It aims to identify two outliers from five given images, in contrast to single-answer questions in previous AVR tasks. To solve MC2R tasks, a Regression Residual Reasoning with Pseudo-labeled Contrastive Learning (R3PCL) is proposed, which first transforms the original problem by selecting three images following the same rule, and iteratively regresses one normal image by using the other two, allowing the model to gradually comprehend the underlying rules. The proposed PCL leverages a set of min-max operations to generate more reliable pseudo labels, and exploits contrastive learning with data augmentation on pseudo-labeled images to boost the discrimination and generalization of features. Experimental results on two AVR datasets show that the proposed R3PCL significantly outperforms state-of-the-art models.
List of keywords
Knowledge Representation and Reasoning -> KRR: Learning and reasoning
3597
ScreenAgent: A Vision Language Model-driven Computer Control Agent
Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, Qi Wang
6 min. talk | August 6th at 11:30 | Session: NLP: Dialogue and interactive systems
[+] More
[-] Less
Large Language Models (LLM) can invoke a variety of tools and APIs to complete complex tasks. The computer, as the most powerful and universal tool, could potentially be controlled by a trained LLM agent. Powered by the computer, we can hopefully build a more generalized agent to assist humans in various daily digital works. In this paper, we construct an environment for a Vision Language Model (VLM) agent to interact with a real computer screen. Within this environment, the agent can observe screenshots and manipulate the Graphical User Interface (GUI) by outputting mouse and keyboard actions. We also design an automated control pipeline that includes planning, acting, and reflecting phases, guiding the agent to continuously interact with the environment and complete multi-step tasks. Additionally, we construct the ScreenAgent Dataset, which collects screenshots and action sequences when completing daily computer tasks. Finally, we train a model, ScreenAgent, which achieves comparable computer control capabilities to GPT-4V and demonstrated more precise UI positioning capabilities. Our attempts could inspire further research on building a generalist LLM agent. The code and more detailed information are at https://github.com/niuzaisheng/ScreenAgent.
List of keywords
Natural Language Processing -> NLP: Dialogue and interactive systems
Agent-based and Multi-agent Systems -> MAS: Human-agent interaction
Computer Vision -> CV: Vision, language and reasoning
Natural Language Processing -> NLP: Resources and evaluation
3600
TFLOP: Table Structure Recognition Framework with Layout Pointer Mechanism
Minsoo Khang, Teakgyu Hong
6 min. talk | August 7th at 15:00 | Session: CV: Recognition (object detection, categorization)
[+] More
[-] Less
Table Structure Recognition (TSR) is a task aimed at converting table images into a machine-readable format (e.g. HTML), to facilitate other applications such as information retrieval. Recent works tackle this problem by identifying the HTML tags and text regions, where the latter is used for text extraction from the table document. These works however, suffer from misalignment issues when mapping text into the identified text regions. In this paper, we introduce a new TSR framework, called TFLOP (TSR Framework with LayOut Pointer mechanism), which reformulates the conventional text region prediction and matching into a direct text region pointing problem. Specifically, TFLOP utilizes text region information to identify both the table’s structure tags and its aligned text regions, simultaneously. Without the need for region prediction and alignment, TFLOP circumvents the additional text region matching stage, which requires finely-calibrated post-processing. TFLOP also employs span-aware contrastive supervision to enhance the pointing mechanism in tables with complex structure. As a result, TFLOP achieves the state-of-the-art performance across multiple benchmarks such as PubTabNet, FinTabNet, and SynthTabNet. In our extensive experiments, TFLOP not only exhibits competitive performance but also shows promising results on industrial document TSR scenarios such as documents with watermarks or in non-English domain. Source code of our work is publicly available at: https://github.com/UpstageAI/TFLOP.
List of keywords
Computer Vision -> CV: Recognition (object detection, categorization)
Computer Vision -> CV: Applications
Natural Language Processing -> NLP: Applications
3601
A Bias-Free Revenue-Maximizing Bidding Strategy for Data Consumers in Auction-based Federated Learning
Xiaoli Tang, Han Yu, Zengxiang Li, Xiaoxiao Li
6 min. talk | August 6th at 15:00 | Session: ML: Federated learning (1/2)
[+] More
[-] Less
Auction-based Federated Learning (AFL) is a burgeoning research area. However, existing bidding strategies for AFL data consumers (DCs) primarily focus on maximizing expected accumulated utility, disregarding the more complex goal of revenue maximization. They also only consider winning bids, leading to biased estimates by overlooking information from losing bids. To address these issues, we propose a Bias-free Revenue-maximizing Federated bidding strategy for DCs in AFL (BR-FEDBIDDER). Our theoretical exploration of the relationships between Return on Investment (ROI), bid costs, and utility, and their impact on overall revenue underscores the complexity of maximizing revenue solely by prioritizing ROI enhancement. Leveraging these insights, BR-FEDBIDDER optimizes bid costs with any given ROI constraint. In addition, we incorporate an auxiliary task of winning probability estimation into the framework to achieve bias-free learning by leveraging bid records from historical bid requests, including both winning and losing ones. Extensive experiments on six widely used benchmark datasets show that BR-FEDBIDDER outperforms eight state-of-the-art methods, surpassing the best-performing baseline by 5.66%, 6.08% and 2.44% in terms of the total revenue, ROI, and test accuracy of the resulting FL models, respectively.
List of keywords
Machine Learning -> ML: Federated learning
3643
A Neural Column Generation Approach to the Vehicle Routing Problem with Two-Dimensional Loading and Last-In-First-Out Constraints
Yifan Xia, Xiangyi Zhang
6 min. talk | August 7th at 10:00 | Session: CSO: Constraint optimization problems
[+] More
[-] Less
The vehicle routing problem with two-dimensional loading constraints (2L-CVRP) and the last-in-first-out (LIFO) rule presents significant practical and algorithmic challenges. While numerous heuristic approaches have been proposed to address its complexity, stemming from two NP-hard problems: the vehicle routing problem (VRP) and the two-dimensional bin packing problem (2D-BPP), less attention has been paid to developing exact algorithms. Bridging this gap, this article presents an exact algorithm that integrates advanced machine learning techniques, specifically a novel combination of attention and recurrence mechanisms. This integration accelerates the state-of-the-art exact algorithm by a median of 29.79% across various problem instances. Moreover, the proposed algorithm successfully resolves an open instance in the standard test-bed, demonstrating significant improvements brought about by the incorporation of machine learning models. Code is available at https://github.com/xyfffff/NCG-for-2L-CVRP.
List of keywords
Constraint Satisfaction and Optimization -> CSO: Constraint optimization problems
Constraint Satisfaction and Optimization -> CSO: Modeling
Machine Learning -> ML: Applications
Multidisciplinary Topics and Applications -> MTA: Transportation
3654
KDDC: Knowledge-Driven Disentangled Causal Metric Learning for Pre-Travel Out-of-Town Recommendation
Yinghui Liu, Guojiang Shen, Chengyong Cui, Zhenzhen Zhao, Xiao Han, Jiaxin Du, Xiangyu Zhao, Xiangjie Kong
6 min. talk | August 9th at 10:00 | Session: DM: Recommender systems
[+] More
[-] Less
Pre-travel recommendation is developed to provide a variety of out-