-
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning
Authors:
Mustafa Dogan,
Ilker Kesen,
Iacer Calixto,
Aykut Erdem,
Erkut Erdem
Abstract:
The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy of few-shot In-Context Learning (ICL), and Chain-of-Thought (CoT) prompting. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in mode…
▽ More
The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy of few-shot In-Context Learning (ICL), and Chain-of-Thought (CoT) prompting. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets. The experimental results reveal that ICL and CoT prompting significantly boost model performance, particularly in tasks requiring complex reasoning and contextual understanding. Models pretrained on captioning datasets show superior zero-shot performance, while those trained on interleaved image-text data benefit from few-shot learning. Our findings provide valuable insights into optimizing MLLMs for better grounding of language in visual contexts, highlighting the importance of the composition of pretraining data and the potential of few-shot learning strategies to improve the reasoning abilities of MLLMs.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models
Authors:
Yigit Ekin,
Ahmet Burak Yildirim,
Erdem Eren Caglar,
Aykut Erdem,
Erkut Erdem,
Aysegul Dundar
Abstract:
Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images.…
▽ More
Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models
Authors:
Burak Can Biner,
Farrin Marouf Sofian,
Umur Berkay Karakaş,
Duygu Ceylan,
Erkut Erdem,
Aykut Erdem
Abstract:
We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective…
▽ More
We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective modality is audio since sound and sight are two main components of human perception. Hence, we propose a method to enable audio-conditioning in large scale image diffusion models. Our method first maps features obtained from audio clips to tokens that can be injected into the diffusion model in a fashion similar to text tokens. We introduce additional audio-image cross attention layers which we finetune while freezing the weights of the original layers of the diffusion model. In addition to audio conditioned image generation, our method can also be utilized in conjuction with diffusion based editing methods to enable audio conditioned image editing. We demonstrate our method on a wide range of audio and image datasets. We perform extensive comparisons with recent methods and show favorable performance.
△ Less
Submitted 1 May, 2024;
originally announced May 2024.
-
Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
Authors:
Emre Can Acikgoz,
Osman Batur İnce,
Rayene Bench,
Arda Anıl Boz,
İlker Kesen,
Aykut Erdem,
Erkut Erdem
Abstract:
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for…
▽ More
The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Sequential Compositional Generalization in Multimodal Models
Authors:
Semih Yagcioglu,
Osman Batur İnce,
Aykut Erdem,
Erkut Erdem,
Desmond Elliott,
Deniz Yuret
Abstract:
The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address thi…
▽ More
The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{http://cyberiada.github.io/CompAct}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
Authors:
Ilker Kesen,
Andrea Pedrotti,
Mustafa Dogan,
Michele Cafagna,
Emre Can Acikgoz,
Letitia Parcalabescu,
Iacer Calixto,
Anette Frank,
Albert Gatt,
Aykut Erdem,
Erkut Erdem
Abstract:
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm foo…
▽ More
With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers
Authors:
Osman Batur İnce,
Tanin Zeraati,
Semih Yagcioglu,
Yadollah Yaghoobzadeh,
Erkut Erdem,
Aykut Erdem
Abstract:
Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harness…
▽ More
Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harnessing the power of dataset cartography (Swayamdipta et al., 2020). By strategically identifying a subset of compositional generalization data using this approach, we achieve a remarkable improvement in model accuracy, yielding enhancements of up to 10% on CFQ and COGS datasets. Notably, our technique incorporates dataset cartography as a curriculum learning criterion, eliminating the need for hyperparameter tuning while consistently achieving superior performance. Our findings highlight the untapped potential of dataset cartography in unleashing the full capabilities of compositional generalization within Transformer models. Our code is available at https://github.com/cyberiada/cartography-for-compositionality.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks
Authors:
Orhan Torun,
Seniha Esen Yuksel,
Erkut Erdem,
Nevrez Imamoglu,
Aykut Erdem
Abstract:
Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Henc…
▽ More
Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Hence, hyperspectral image denoising has attracted lots of attention by the community lately. While recent deep HSI denoising methods have provided effective solutions, their performance under real-life complex noise remains suboptimal, as they lack adaptability to new data. To overcome these limitations, in our work, we introduce a self-modulating convolutional neural network which we refer to as SM-CNN, which utilizes correlated spectral and spatial information. At the core of the model lies a novel block, which we call spectral self-modulating residual block (SSMRB), that allows the network to transform the features in an adaptive manner based on the adjacent spectral data, enhancing the network's ability to handle complex noise. In particular, the introduction of SSMRB transforms our denoising network into a dynamic network that adapts its predicted features while denoising every input HSI with respect to its spatio-spectral characteristics. Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods both quantitatively and qualitatively on public benchmark datasets.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Spherical Vision Transformer for 360-degree Video Saliency Prediction
Authors:
Mert Cokelek,
Nevrez Imamoglu,
Cagri Ozcinar,
Erkut Erdem,
Aykut Erdem
Abstract:
The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectiona…
▽ More
The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectional videos named SalViT360 that leverages tangent image representations. We introduce a spherical geometry-aware spatiotemporal self-attention mechanism that is capable of effective omnidirectional video understanding. Furthermore, we present a consistency-based unsupervised regularization term for projection-based 360-degree dense-prediction models to reduce artefacts in the predictions that occur after inverse projection. Our approach is the first to employ tangent images for omnidirectional saliency prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing
Authors:
Ahmet Canberk Baykal,
Abdul Basit Anees,
Duygu Ceylan,
Erkut Erdem,
Aykut Erdem,
Deniz Yuret
Abstract:
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. H…
▽ More
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.
△ Less
Submitted 18 July, 2023; v1 submitted 17 July, 2023;
originally announced July 2023.
-
HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks
Authors:
Burak Ercan,
Onur Eker,
Canberk Saglam,
Aykut Erdem,
Erkut Erdem
Abstract:
Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our appro…
▽ More
Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach uses hypernetworks to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We also employ a curriculum learning strategy to train the network more robustly. Our comprehensive experimental evaluations across various benchmark datasets reveal that HyperE2VID not only surpasses current state-of-the-art methods in terms of reconstruction quality but also achieves this with fewer parameters, reduced computational requirements, and accelerated inference times.
△ Less
Submitted 20 February, 2024; v1 submitted 10 May, 2023;
originally announced May 2023.
-
EVREAL: Towards a Comprehensive Benchmark and Analysis Suite for Event-based Video Reconstruction
Authors:
Burak Ercan,
Onur Eker,
Aykut Erdem,
Erkut Erdem
Abstract:
Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep lea…
▽ More
Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep learning-based methods have shown promise in video reconstruction from events, this problem is not completely solved yet. To facilitate comparison between different approaches, standardized evaluation protocols and diverse test datasets are essential. This paper proposes a unified evaluation methodology and introduces an open-source framework called EVREAL to comprehensively benchmark and analyze various event-based video reconstruction methods from the literature. Using EVREAL, we give a detailed analysis of the state-of-the-art methods for event-based video reconstruction, and provide valuable insights into the performance of these methods under varying settings, challenging scenarios, and downstream tasks.
△ Less
Submitted 5 April, 2024; v1 submitted 30 April, 2023;
originally announced May 2023.
-
VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs
Authors:
Moayed Haji Ali,
Andrew Bond,
Tolga Birdal,
Duygu Ceylan,
Levent Karacan,
Erkut Erdem,
Aykut Erdem
Abstract:
We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hi…
▽ More
We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hindered by the difficulty of representing and controlling videos in the latent space of GANs. In particular, videos are composed of content (i.e., appearance) and complex motion components that require a special mechanism to disentangle and control. To achieve this, VidStyleODE encodes the video content in a pre-trained StyleGAN $\mathcal{W}_+$ space and benefits from a latent ODE component to summarize the spatiotemporal dynamics of the input video. Our novel continuous video generation process then combines the two to generate high-quality and temporally consistent videos with varying frame rates. We show that our proposed method enables a variety of applications on real videos: text-guided appearance manipulation, motion manipulation, image animation, and video interpolation and extrapolation. Project website: https://cyberiada.github.io/VidStyleODE
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
Inst-Inpaint: Instructing to Remove Objects with Diffusion Models
Authors:
Ahmet Burak Yildirim,
Vedat Baday,
Erkut Erdem,
Aykut Erdem,
Aysegul Dundar
Abstract:
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we ar…
▽ More
Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
△ Less
Submitted 9 August, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
ST360IQ: No-Reference Omnidirectional Image Quality Assessment with Spherical Vision Transformers
Authors:
Nafiseh Jabbari Tofighi,
Mohamed Hedi Elfkir,
Nevrez Imamoglu,
Cagri Ozcinar,
Erkut Erdem,
Aykut Erdem
Abstract:
Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natura…
▽ More
Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natural images for omnidirectional data poses certain challenges. These models need to deal with very high-resolution data and implicit distortions due to the spherical form of the images. In this study, we present a method for no-reference 360 image quality assessment. Our proposed ST360IQ model extracts tangent viewports from the salient parts of the input omnidirectional image and employs a vision-transformers based module processing saliency selective patches/tokens that estimates a quality score from each viewport. Then, it aggregates these scores to give a final quality score. Our experiments on two benchmark datasets, namely OIQA and CVIQ datasets, demonstrate that as compared to the state-of-the-art, our approach predicts the quality of an omnidirectional image correlated with the human-perceived image quality. The code has been available on https://github.com/Nafiseh-Tofighi/ST360IQ
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Detecting Euphemisms with Literal Descriptions and Visual Imagery
Authors:
İlker Kesen,
Aykut Erdem,
Erkut Erdem,
Iacer Calixto
Abstract:
This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In…
▽ More
This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In the first stage, we seek to mitigate this ambiguity by incorporating literal descriptions into input text prompts to our baseline model. It turns out that this kind of direct supervision yields remarkable performance improvement. In the second stage, we integrate visual supervision into our system using visual imageries, two sets of images generated by a text-to-image model by taking terms and descriptions as input. Our experiments demonstrate that visual supervision also gives a statistically significant performance boost. Our system achieved the second place with an F1 score of 87.2%, only about 0.9% worse than the best submission.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Disentangling Content and Motion for Text-Based Neural Video Manipulation
Authors:
Levent Karacan,
Tolga Kerimoğlu,
İsmail İnan,
Tolga Birdal,
Erkut Erdem,
Aykut Erdem
Abstract:
Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating vi…
▽ More
Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.
△ Less
Submitted 5 November, 2022;
originally announced November 2022.
-
Perception-Distortion Trade-off in the SR Space Spanned by Flow Models
Authors:
Cansu Korkmaz,
A. Murat Tekalp,
Zafer Dogan,
Erkut Erdem,
Aykut Erdem
Abstract:
Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fus…
▽ More
Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fusion approach to obtain a single SR image eliminating random artifacts and improving fidelity without significantly compromising perceptual quality. We achieve this by benefiting from a diverse set of feasible photo-realistic solutions in the SR space spanned by flow models. We propose different image ensembling and fusion strategies which offer multiple paths to move sample solutions in the SR space to more desired destinations in the perception-distortion plane in a controllable manner depending on the fidelity vs. perceptual quality requirements of the task at hand. Experimental results demonstrate that our image ensembling/fusion strategy achieves more promising perception-distortion trade-off compared to sample SR images produced by flow models and adversarially trained models in terms of both quantitative metrics and visual quality.
△ Less
Submitted 18 September, 2022;
originally announced September 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
Stable Marriage Problems with Ties and Incomplete Preferences: An Empirical Comparison of ASP, SAT, ILP, CP, and Local Search Methods
Authors:
Selin Eyupoglu,
Muge Fidan,
Yavuz Gulesen,
Ilayda Begum Izci,
Berkan Teber,
Baturay Yilmaz,
Ahmet Alkan,
Esra Erdem
Abstract:
We study a variation of the Stable Marriage problem, where every man and every woman express their preferences as preference lists which may be incomplete and contain ties. This problem is called the Stable Marriage problem with Ties and Incomplete preferences (SMTI). We consider three optimization variants of SMTI, Max Cardinality, Sex-Equal and Egalitarian, and empirically compare the following…
▽ More
We study a variation of the Stable Marriage problem, where every man and every woman express their preferences as preference lists which may be incomplete and contain ties. This problem is called the Stable Marriage problem with Ties and Incomplete preferences (SMTI). We consider three optimization variants of SMTI, Max Cardinality, Sex-Equal and Egalitarian, and empirically compare the following methods to solve them: Answer Set Programming, Constraint Programming, Integer Linear Programming. For Max Cardinality, we compare these methods with Local Search methods as well. We also empirically compare Answer Set Programming with Propositional Satisfiability, for SMTI instances. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).
△ Less
Submitted 17 August, 2021; v1 submitted 11 August, 2021;
originally announced August 2021.
-
Knowledge-Based Stable Roommates Problem: A Real-World Application
Authors:
Muge Fidan,
Esra Erdem
Abstract:
The Stable Roommates problem with Ties and Incomplete lists (SRTI) is a matching problem characterized by the preferences of agents over other agents as roommates, where the preferences may have ties or be incomplete. SRTI asks for a matching that is stable and, sometimes, optimizes a domain-independent fairness criterion (e.g., Egalitarian). However, in real-world applications (e.g., assigning st…
▽ More
The Stable Roommates problem with Ties and Incomplete lists (SRTI) is a matching problem characterized by the preferences of agents over other agents as roommates, where the preferences may have ties or be incomplete. SRTI asks for a matching that is stable and, sometimes, optimizes a domain-independent fairness criterion (e.g., Egalitarian). However, in real-world applications (e.g., assigning students as roommates at a dormitory), we usually consider a variety of domain-specific criteria depending on preferences over the habits and desires of the agents. With this motivation, we introduce a knowledge-based method to SRTI considering domain-specific knowledge, and investigate its real-world application for assigning students as roommates at a university dormitory. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
SLAMP: Stochastic Latent Appearance and Motion Prediction
Authors:
Adil Kaan Akan,
Erkut Erdem,
Aykut Erdem,
Fatma Güney
Abstract:
Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components. Most of the previous work utilizing motion is deterministic but there are stochastic methods that can model the inherent uncertainty of the future. Existing stochastic models either do not reason about motion explicitly or make limiting assumptions about the static par…
▽ More
Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components. Most of the previous work utilizing motion is deterministic but there are stochastic methods that can model the inherent uncertainty of the future. Existing stochastic models either do not reason about motion explicitly or make limiting assumptions about the static part. In this paper, we reason about appearance and motion in the video stochastically by predicting the future based on the motion history. Explicit reasoning about motion without history already reaches the performance of current stochastic models. The motion history further improves the results by allowing to predict consistent dynamics several frames into the future. Our model performs comparably to the state-of-the-art models on the generic video prediction datasets, however, significantly outperforms them on two challenging real-world autonomous driving datasets with complex motion and dynamic background.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
A Gated Fusion Network for Dynamic Saliency Prediction
Authors:
Aysun Kocak,
Erkut Erdem,
Aykut Erdem
Abstract:
Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however,…
▽ More
Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however, learn to combine spatial and temporal features in a static manner and do not adapt themselves much to the changes in the video content. In this paper, we introduce Gated Fusion Network for dynamic saliency (GFSalNet), the first deep saliency model capable of making predictions in a dynamic way via gated fusion mechanism. Moreover, our model also exploits spatial and channel-wise attention within a multi-scale architecture that further allows for highly accurate predictions. We evaluate the proposed approach on a number of datasets, and our experimental analysis demonstrates that it outperforms or is highly competitive with the state of the art. Importantly, we show that it has a good generalization ability, and moreover, exploits temporal information more effectively via its adaptive fusion scheme.
△ Less
Submitted 15 February, 2021;
originally announced February 2021.
-
Object and Relation Centric Representations for Push Effect Prediction
Authors:
Ahmet E. Tekden,
Aykut Erdem,
Erkut Erdem,
Tamim Asfour,
Emre Ugur
Abstract:
Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement, reasoning about object relations in the scene, and thus pushing actions have been widely studied in robotics. The effective use of pushing actions often requires an understanding of the dynamics of the manipulated objects and adaptation to the discrepancies between p…
▽ More
Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement, reasoning about object relations in the scene, and thus pushing actions have been widely studied in robotics. The effective use of pushing actions often requires an understanding of the dynamics of the manipulated objects and adaptation to the discrepancies between prediction and reality. For this reason, effect prediction and parameter estimation with pushing actions have been heavily investigated in the literature. However, current approaches are limited because they either model systems with a fixed number of objects or use image-based representations whose outputs are not very interpretable and quickly accumulate errors. In this paper, we propose a graph neural network based framework for effect prediction and parameter estimation of pushing actions by modeling object relations based on contacts or articulations. Our framework is validated both in real and simulated environments containing different shaped multi-part objects connected via different types of joints and objects with different masses, and it outperforms image-based representations on physics prediction. Our approach enables the robot to predict and adapt the effect of a pushing action as it observes the scene. It can also be used for tool manipulation with never-seen tools. Further, we demonstrate 6D effect prediction in the lever-up action in the context of robot-based hard-disk disassembly.
△ Less
Submitted 22 February, 2023; v1 submitted 3 February, 2021;
originally announced February 2021.
-
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Authors:
Ozan Caglayan,
Menekse Kuyu,
Mustafa Sercan Amac,
Pranava Madhyastha,
Erkut Erdem,
Aykut Erdem,
Lucia Specia
Abstract:
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the…
▽ More
Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.
△ Less
Submitted 20 April, 2021; v1 submitted 25 January, 2021;
originally announced January 2021.
-
Post-hoc Uncertainty Calibration for Domain Drift Scenarios
Authors:
Christian Tomani,
Sebastian Gruber,
Muhammed Ebrar Erdem,
Daniel Cremers,
Florian Buettner
Abstract:
We address the problem of uncertainty calibration. While standard deep neural networks typically yield uncalibrated predictions, calibrated confidence scores that are representative of the true likelihood of a prediction can be achieved using post-hoc calibration methods. However, to date the focus of these approaches has been on in-domain calibration. Our contribution is two-fold. First, we show…
▽ More
We address the problem of uncertainty calibration. While standard deep neural networks typically yield uncalibrated predictions, calibrated confidence scores that are representative of the true likelihood of a prediction can be achieved using post-hoc calibration methods. However, to date the focus of these approaches has been on in-domain calibration. Our contribution is two-fold. First, we show that existing post-hoc calibration methods yield highly over-confident predictions under domain shift. Second, we introduce a simple strategy where perturbations are applied to samples in the validation set before performing the post-hoc calibration step. In extensive experiments, we demonstrate that this perturbation step results in substantially better calibration under domain shift on a wide range of architectures and modelling tasks.
△ Less
Submitted 23 June, 2021; v1 submitted 20 December, 2020;
originally announced December 2020.
-
MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish
Authors:
Begum Citamak,
Ozan Caglayan,
Menekse Kuyu,
Erkut Erdem,
Aykut Erdem,
Pranava Madhyastha,
Lucia Specia
Abstract:
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other…
▽ More
Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English-Turkish descriptions also enables the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.
△ Less
Submitted 13 December, 2020;
originally announced December 2020.
-
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions
Authors:
Tayfun Ates,
M. Samil Atesoglu,
Cagatay Yigit,
Ilker Kesen,
Mert Kobas,
Erkut Erdem,
Aykut Erdem,
Tilbe Goksun,
Deniz Yuret
Abstract:
Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and q…
▽ More
Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.
△ Less
Submitted 1 March, 2022; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Dynamic Multi-Agent Path Finding based on Conflict Resolution using Answer Set Programming
Authors:
Basem Atiq,
Volkan Patoglu,
Esra Erdem
Abstract:
We study a dynamic version of multi-agent path finding problem (called D-MAPF) where existing agents may leave and new agents may join the team at different times. We introduce a new method to solve D-MAPF based on conflict-resolution. The idea is, when a set of new agents joins the team and there are conflicts, instead of replanning for the whole team, to replan only for a minimal subset of agent…
▽ More
We study a dynamic version of multi-agent path finding problem (called D-MAPF) where existing agents may leave and new agents may join the team at different times. We introduce a new method to solve D-MAPF based on conflict-resolution. The idea is, when a set of new agents joins the team and there are conflicts, instead of replanning for the whole team, to replan only for a minimal subset of agents whose plans conflict with each other. We utilize answer set programming as part of our method for planning, replanning and identifying minimal set of conflicts.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Solving Gossip Problems using Answer Set Programming: An Epistemic Planning Approach
Authors:
Esra Erdem,
Andreas Herzig
Abstract:
We investigate the use of Answer Set Programming to solve variations of gossip problems, by modeling them as epistemic planning problems.
We investigate the use of Answer Set Programming to solve variations of gossip problems, by modeling them as epistemic planning problems.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Reasoning about Cardinal Directions between 3-Dimensional Extended Objects using Answer Set Programming
Authors:
Yusuf Izmirlioglu,
Esra Erdem
Abstract:
We propose a novel formal framework (called 3D-nCDC-ASP) to represent and reason about cardinal directions between extended objects in 3-dimensional (3D) space, using Answer Set Programming (ASP). 3D-nCDC-ASP extends Cardinal Directional Calculus (CDC) with a new type of default constraints, and nCDC-ASP to 3D. 3D-nCDC-ASP provides a flexible platform offering different types of reasoning: Nonmono…
▽ More
We propose a novel formal framework (called 3D-nCDC-ASP) to represent and reason about cardinal directions between extended objects in 3-dimensional (3D) space, using Answer Set Programming (ASP). 3D-nCDC-ASP extends Cardinal Directional Calculus (CDC) with a new type of default constraints, and nCDC-ASP to 3D. 3D-nCDC-ASP provides a flexible platform offering different types of reasoning: Nonmonotonic reasoning with defaults, checking consistency of a set of constraints on 3D cardinal directions between objects, explaining inconsistencies, and inferring missing CDC relations. We prove the soundness of 3D-nCDC-ASP, and illustrate its usefulness with applications. This paper is under consideration for acceptance in TPLP.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Explanation Generation for Multi-Modal Multi-Agent Path Finding with Optimal Resource Utilization using Answer Set Programming
Authors:
Aysu Bogatarkan,
Esra Erdem
Abstract:
The multi-agent path finding (MAPF) problem is a combinatorial search problem that aims at finding paths for multiple agents (e.g., robots) in an environment (e.g., an autonomous warehouse) such that no two agents collide with each other, and subject to some constraints on the lengths of paths. We consider a general version of MAPF, called mMAPF, that involves multi-modal transportation modes (e.g…
▽ More
The multi-agent path finding (MAPF) problem is a combinatorial search problem that aims at finding paths for multiple agents (e.g., robots) in an environment (e.g., an autonomous warehouse) such that no two agents collide with each other, and subject to some constraints on the lengths of paths. We consider a general version of MAPF, called mMAPF, that involves multi-modal transportation modes (e.g., due to velocity constraints) and consumption of different types of resources (e.g., batteries). The real-world applications of mMAPF require flexibility (e.g., solving variations of mMAPF) as well as explainability. Our earlier studies on mMAPF have focused on the former challenge of flexibility. In this study, we focus on the latter challenge of explainability, and introduce a method for generating explanations for queries regarding the feasibility and optimality of solutions, the nonexistence of solutions, and the observations about solutions. Our method is based on answer set programming. This paper is under consideration for acceptance in TPLP.
△ Less
Submitted 8 August, 2020;
originally announced August 2020.
-
Human Robot Collaborative Assembly Planning: An Answer Set Programming Approach
Authors:
Momina Rizwan,
Volkan Patoglu,
Esra Erdem
Abstract:
For planning an assembly of a product from a given set of parts, robots necessitate certain cognitive skills: high-level planning is needed to decide the order of actuation actions, while geometric reasoning is needed to check the feasibility of these actions. For collaborative assembly tasks with humans, robots require further cognitive capabilities, such as commonsense reasoning, sensing, and co…
▽ More
For planning an assembly of a product from a given set of parts, robots necessitate certain cognitive skills: high-level planning is needed to decide the order of actuation actions, while geometric reasoning is needed to check the feasibility of these actions. For collaborative assembly tasks with humans, robots require further cognitive capabilities, such as commonsense reasoning, sensing, and communication skills, not only to cope with the uncertainty caused by incomplete knowledge about the humans' behaviors but also to ensure safer collaborations. We propose a novel method for collaborative assembly planning under uncertainty, that utilizes hybrid conditional planning extended with commonsense reasoning and a rich set of communication actions for collaborative tasks. Our method is based on answer set programming. We show the applicability of our approach in a real-world assembly domain, where a bi-manual Baxter robot collaborates with a human teammate to assemble furniture. This manuscript is under consideration for acceptance in TPLP.
△ Less
Submitted 8 August, 2020;
originally announced August 2020.
-
A General Framework for Stable Roommates Problems using Answer Set Programming
Authors:
Esra Erdem,
Muge Fidan,
David Manlove,
Patrick Prosser
Abstract:
The Stable Roommates problem (SR) is characterized by the preferences of agents over other agents as roommates: each agent ranks all others in strict order of preference. A solution to SR is then a partition of the agents into pairs so that each pair shares a room, and there is no pair of agents that would block this matching (i.e., who prefers the other to their roommate in the matching). There a…
▽ More
The Stable Roommates problem (SR) is characterized by the preferences of agents over other agents as roommates: each agent ranks all others in strict order of preference. A solution to SR is then a partition of the agents into pairs so that each pair shares a room, and there is no pair of agents that would block this matching (i.e., who prefers the other to their roommate in the matching). There are interesting variations of SR that are motivated by applications (e.g., the preference lists may be incomplete (SRI) and involve ties (SRTI)), and that try to find a more fair solution (e.g., Egalitarian SR). Unlike the Stable Marriage problem, every SR instance is not guaranteed to have a solution. For that reason, there are also variations of SR that try to find a good-enough solution (e.g., Almost SR). Most of these variations are NP-hard. We introduce a formal framework, called SRTI-ASP, utilizing the logic programming paradigm Answer Set Programming, that is provable and general enough to solve many of such variations of SR. Our empirical analysis shows that SRTI-ASP is also promising for applications. This paper is under consideration for acceptance in TPLP.
△ Less
Submitted 7 August, 2020;
originally announced August 2020.
-
Burst Photography for Learning to Enhance Extremely Dark Images
Authors:
Ahmet Serdar Karadeniz,
Erkut Erdem,
Aykut Erdem
Abstract:
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quali…
▽ More
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.
△ Less
Submitted 19 November, 2021; v1 submitted 17 June, 2020;
originally announced June 2020.
-
Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters
Authors:
İlker Kesen,
Ozan Arkan Can,
Erkut Erdem,
Aykut Erdem,
Deniz Yuret
Abstract:
How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from p…
▽ More
How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a U-Net-based model and perform experiments on two language-vision dense-prediction tasks: referring expression segmentation and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves competitive performance. Our linguistic analysis suggests that bottom-up conditioning improves segmentation of objects especially when input text refers to low-level visual concepts. Code is available at https://github.com/ilkerkesen/bvpr.
△ Less
Submitted 23 June, 2022; v1 submitted 28 March, 2020;
originally announced March 2020.
-
Burst Denoising of Dark Images
Authors:
Ahmet Serdar Karadeniz,
Erkut Erdem,
Aykut Erdem
Abstract:
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional image enhancement techniques almost impossible to apply. Very recently, researchers have shown promising results using learning based approaches. Motivated by these ideas, in this paper, we propose a deep learning framewo…
▽ More
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional image enhancement techniques almost impossible to apply. Very recently, researchers have shown promising results using learning based approaches. Motivated by these ideas, in this paper, we propose a deep learning framework for obtaining clean and colorful RGB images from extremely dark raw images. The backbone of our framework is a novel coarse-to-fine network architecture that generates high-quality outputs in a progressive manner. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce noise and improve color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that the proposed approach leads to perceptually more pleasing results than state-of-the-art methods by producing much sharper and higher quality images.
△ Less
Submitted 18 June, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
mustGAN: Multi-Stream Generative Adversarial Networks for MR Image Synthesis
Authors:
Mahmut Yurt,
Salman Ul Hassan Dar,
Aykut Erdem,
Erkut Erdem,
Tolga Çukur
Abstract:
Multi-contrast MRI protocols increase the level of morphological information available for diagnosis. Yet, the number and quality of contrasts is limited in practice by various factors including scan time and patient motion. Synthesis of missing or corrupted contrasts can alleviate this limitation to improve clinical utility. Common approaches for multi-contrast MRI involve either one-to-one and m…
▽ More
Multi-contrast MRI protocols increase the level of morphological information available for diagnosis. Yet, the number and quality of contrasts is limited in practice by various factors including scan time and patient motion. Synthesis of missing or corrupted contrasts can alleviate this limitation to improve clinical utility. Common approaches for multi-contrast MRI involve either one-to-one and many-to-one synthesis methods. One-to-one methods take as input a single source contrast, and they learn a latent representation sensitive to unique features of the source. Meanwhile, many-to-one methods receive multiple distinct sources, and they learn a shared latent representation more sensitive to common features across sources. For enhanced image synthesis, here we propose a multi-stream approach that aggregates information across multiple source images via a mixture of multiple one-to-one streams and a joint many-to-one stream. The shared feature maps generated in the many-to-one stream and the complementary feature maps generated in the one-to-one streams are combined with a fusion block. The location of the fusion block is adaptively modified to maximize task-specific performance. Qualitative and quantitative assessments on T1-, T2-, PD-weighted and FLAIR images clearly demonstrate the superior performance of the proposed method compared to previous state-of-the-art one-to-one and many-to-one methods.
△ Less
Submitted 25 September, 2019;
originally announced September 2019.
-
Procedural Reasoning Networks for Understanding Multimodal Procedures
Authors:
Mustafa Sercan Amac,
Semih Yagcioglu,
Aykut Erdem,
Erkut Erdem
Abstract:
This paper addresses the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to p…
▽ More
This paper addresses the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.
△ Less
Submitted 19 September, 2019;
originally announced September 2019.
-
Proceedings 35th International Conference on Logic Programming (Technical Communications)
Authors:
Bart Bogaerts,
Esra Erdem,
Paul Fodor,
Andrea Formisano,
Giovambattista Ianni,
Daniela Inclezan,
German Vidal,
Alicia Villanueva,
Marina De Vos,
Fangkai Yang
Abstract:
Since the first conference held in Marseille in 1982, ICLP has been the premier international event for presenting research in logic programming. Contributions are sought in all areas of logic programming, including but not restricted to:
Foundations: Semantics, Formalisms, Nonmonotonic reasoning, Knowledge representation.
Languages: Concurrency, Objects, Coordination, Mobility, Higher Order,…
▽ More
Since the first conference held in Marseille in 1982, ICLP has been the premier international event for presenting research in logic programming. Contributions are sought in all areas of logic programming, including but not restricted to:
Foundations: Semantics, Formalisms, Nonmonotonic reasoning, Knowledge representation.
Languages: Concurrency, Objects, Coordination, Mobility, Higher Order, Types, Modes, Assertions, Modules, Meta-programming, Logic-based domain-specific languages, Programming Techniques.
Declarative programming: Declarative program development, Analysis, Type and mode inference, Partial evaluation, Abstract interpretation, Transformation, Validation, Verification, Debugging, Profiling, Testing, Execution visualization
Implementation: Virtual machines, Compilation, Memory management, Parallel/distributed execution, Constraint handling rules, Tabling, Foreign interfaces, User interfaces.
Related Paradigms and Synergies: Inductive and Co-inductive Logic Programming, Constraint Logic Programming, Answer Set Programming, Interaction with SAT, SMT and CSP solvers, Logic programming techniques for type inference and theorem proving, Argumentation, Probabilistic Logic Programming, Relations to object-oriented and Functional programming.
Applications: Databases, Big Data, Data integration and federation, Software engineering, Natural language processing, Web and Semantic Web, Agents, Artificial intelligence, Computational life sciences, Education, Cybersecurity, and Robotics.
△ Less
Submitted 17 September, 2019;
originally announced September 2019.
-
Belief Regulated Dual Propagation Nets for Learning Action Effects on Groups of Articulated Objects
Authors:
Ahmet E. Tekden,
Aykut Erdem,
Erkut Erdem,
Mert Imre,
M. Yunus Seker,
Emre Ugur
Abstract:
In recent years, graph neural networks have been successfully applied for learning the dynamics of complex and partially observable physical systems. However, their use in the robotics domain is, to date, still limited. In this paper, we introduce Belief Regulated Dual Propagation Networks (BRDPN), a general-purpose learnable physics engine, which enables a robot to predict the effects of its acti…
▽ More
In recent years, graph neural networks have been successfully applied for learning the dynamics of complex and partially observable physical systems. However, their use in the robotics domain is, to date, still limited. In this paper, we introduce Belief Regulated Dual Propagation Networks (BRDPN), a general-purpose learnable physics engine, which enables a robot to predict the effects of its actions in scenes containing groups of articulated multi-part objects. Specifically, our framework extends recently proposed propagation networks (PropNets) and consists of two complementary components, a physics predictor and a belief regulator. While the former predicts the future states of the object(s) manipulated by the robot, the latter constantly corrects the robot's knowledge regarding the objects and their relations. Our results showed that after training in a simulator, the robot can reliably predict the consequences of its actions in object trajectory level and exploit its own interaction experience to correct its belief about the state of the environment, enabling better predictions in partially observable environments. Furthermore, the trained model was transferred to the real world and verified in predicting trajectories of pushed interacting objects whose joint relations were initially unknown. We compared BRDPN against PropNets, and showed that BRDPN performs consistently well. Moreover, BRDPN can adapt its physic predictions, since the relations can be predicted online.
△ Less
Submitted 16 March, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Introduction to the 35th International Conference on Logic Programming Special Issue
Authors:
Esra Erdem,
Andrea Formisano,
German Vidal,
Fangkai Yang
Abstract:
We are proud to introduce this special issue of Theory and Practice of Logic Programming (TPLP), dedicated to the regular papers accepted for the 35th International Conference on Logic Programming (ICLP). The ICLP meetings started in Marseille in 1982 and since then constitute the main venue for presenting and discussing work in the area of logic programming. Under consideration for acceptance in…
▽ More
We are proud to introduce this special issue of Theory and Practice of Logic Programming (TPLP), dedicated to the regular papers accepted for the 35th International Conference on Logic Programming (ICLP). The ICLP meetings started in Marseille in 1982 and since then constitute the main venue for presenting and discussing work in the area of logic programming. Under consideration for acceptance in TPLP.
△ Less
Submitted 10 August, 2019;
originally announced August 2019.
-
Object Placement on Cluttered Surfaces: A Nested Local Search Approach
Authors:
Abdul Rahman Dabbour,
Esra Erdem,
Volkan Patoglu
Abstract:
For planning rearrangements of objects in a clutter, it is required to know the goal configuration of the objects. However, in real life scenarios, this information is not available most of the time. We introduce a novel method that computes a collision-free placement of objects on a cluttered surface, while minimizing the total number and amount of displacements of the existing moveable objects.…
▽ More
For planning rearrangements of objects in a clutter, it is required to know the goal configuration of the objects. However, in real life scenarios, this information is not available most of the time. We introduce a novel method that computes a collision-free placement of objects on a cluttered surface, while minimizing the total number and amount of displacements of the existing moveable objects. Our method applies nested local searches that perform multi-objective optimizations guided by heuristics. Experimental evaluations demonstrate high computational efficiency and success rate of our method, as well as good quality of solutions.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
A Formal Framework for Robot Construction Problems: A Hybrid Planning Approach
Authors:
Faseeh Ahmad,
Esra Erdem,
Volkan Patoglu
Abstract:
We study robot construction problems where multiple autonomous robots rearrange stacks of prefabricated blocks to build stable structures. These problems are challenging due to ramifications of actions, true concurrency, and requirements of supportedness of blocks by other blocks and stability of the structure at all times. We propose a formal hybrid planning framework to solve a wide range of rob…
▽ More
We study robot construction problems where multiple autonomous robots rearrange stacks of prefabricated blocks to build stable structures. These problems are challenging due to ramifications of actions, true concurrency, and requirements of supportedness of blocks by other blocks and stability of the structure at all times. We propose a formal hybrid planning framework to solve a wide range of robot construction problems, based on Answer Set Programming. This framework not only decides for a stable final configuration of the structure, but also computes the order of manipulation tasks for multiple autonomous robots to build the structure from an initial configuration, while simultaneously ensuring the stability, supportedness and other desired properties of the partial construction at each step of the plan. We prove the soundness and completeness of our formal method with respect to these properties. We introduce a set of challenging robot construction benchmark instances, including bridge building and stack overhanging scenarios, discuss the usefulness of our framework over these instances, and demonstrate the applicability of our method using a bimanual Baxter robot.
△ Less
Submitted 17 March, 2019; v1 submitted 2 March, 2019;
originally announced March 2019.
-
RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes
Authors:
Semih Yagcioglu,
Aykut Erdem,
Erkut Erdem,
Nazli Ikizler-Cinbis
Abstract:
Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically…
▽ More
Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically generated question-answer pairs, we design a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge. Our preliminary results indicate that RecipeQA will serve as a challenging test bed and an ideal benchmark for evaluating machine comprehension systems. The data and leaderboard are available at http://hucvl.github.io/recipeqa.
△ Less
Submitted 4 September, 2018;
originally announced September 2018.
-
Manipulating Attributes of Natural Scenes via Hallucination
Authors:
Levent Karacan,
Zeynep Akata,
Aykut Erdem,
Erkut Erdem
Abstract:
In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hall…
▽ More
In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hallucinated with the given attributes, the corresponding look is then transferred to the input image while preserving the semantic details intact, giving a photo-realistic manipulation result. As the proposed framework hallucinates what the scene will look like, it does not require any reference style image as commonly utilized in most of the appearance or style transfer approaches. Moreover, it allows to simultaneously manipulate a given scene according to a diverse set of transient attributes within a single model, eliminating the need of training multiple networks per each translation task. Our comprehensive set of qualitative and quantitative results demonstrate the effectiveness of our approach against the competing methods.
△ Less
Submitted 9 October, 2019; v1 submitted 22 August, 2018;
originally announced August 2018.
-
Language Guided Fashion Image Manipulation with Feature-wise Transformations
Authors:
Mehmet Günel,
Erkut Erdem,
Aykut Erdem
Abstract:
Developing techniques for editing an outfit image through natural sentences and accordingly generating new outfits has promising applications for art, fashion and design. However, it is considered as a certainly challenging task since image manipulation should be carried out only on the relevant parts of the image while keeping the remaining sections untouched. Moreover, this manipulation process…
▽ More
Developing techniques for editing an outfit image through natural sentences and accordingly generating new outfits has promising applications for art, fashion and design. However, it is considered as a certainly challenging task since image manipulation should be carried out only on the relevant parts of the image while keeping the remaining sections untouched. Moreover, this manipulation process should generate an image that is as realistic as possible. In this work, we propose FiLMedGAN, which leverages feature-wise linear modulation (FiLM) to relate and transform visual features with natural language representations without using extra spatial information. Our experiments demonstrate that this approach, when combined with skip connections and total variation regularization, produces more plausible results than the baseline work, and has a better localization capability when generating new outfits consistent with the target description.
△ Less
Submitted 12 August, 2018;
originally announced August 2018.
-
Image Synthesis in Multi-Contrast MRI with Conditional Generative Adversarial Networks
Authors:
Salman Ul Hassan Dar,
Mahmut Yurt,
Levent Karacan,
Aykut Erdem,
Erkut Erdem,
Tolga Çukur
Abstract:
Acquiring images of the same anatomy with multiple different contrasts increases the diversity of diagnostic information available in an MR exam. Yet, scan time limitations may prohibit acquisition of certain contrasts, and images for some contrast may be corrupted by noise and artifacts. In such cases, the ability to synthesize unacquired or corrupted contrasts from remaining contrasts can improv…
▽ More
Acquiring images of the same anatomy with multiple different contrasts increases the diversity of diagnostic information available in an MR exam. Yet, scan time limitations may prohibit acquisition of certain contrasts, and images for some contrast may be corrupted by noise and artifacts. In such cases, the ability to synthesize unacquired or corrupted contrasts from remaining contrasts can improve diagnostic utility. For multi-contrast synthesis, current methods learn a nonlinear intensity transformation between the source and target images, either via nonlinear regression or deterministic neural networks. These methods can in turn suffer from loss of high-spatial-frequency information in synthesized images. Here we propose a new approach for multi-contrast MRI synthesis based on conditional generative adversarial networks. The proposed approach preserves high-frequency details via an adversarial loss; and it offers enhanced synthesis performance via a pixel-wise loss for registered multi-contrast images and a cycle-consistency loss for unregistered images. Information from neighboring cross-sections are utilized to further improved synthesis quality. Demonstrations on T1- and T2-weighted images from healthy subjects and patients clearly indicate the superior performance of the proposed approach compared to previous state-of-the-art methods. Our synthesis approach can help improve quality and versatility of multi-contrast MRI exams without the need for prolonged examinations.
△ Less
Submitted 4 February, 2018;
originally announced February 2018.
-
Hybrid Conditional Planning using Answer Set Programming
Authors:
Ibrahim Faruk Yalciner,
Ahmed Nouman,
Volkan Patoglu,
Esra Erdem
Abstract:
We introduce a parallel offline algorithm for computing hybrid conditional plans, called HCP-ASP, oriented towards robotics applications. HCP-ASP relies on modeling actuation actions and sensing actions in an expressive nonmonotonic language of answer set programming (ASP), and computation of the branches of a conditional plan in parallel using an ASP solver. In particular, thanks to external atom…
▽ More
We introduce a parallel offline algorithm for computing hybrid conditional plans, called HCP-ASP, oriented towards robotics applications. HCP-ASP relies on modeling actuation actions and sensing actions in an expressive nonmonotonic language of answer set programming (ASP), and computation of the branches of a conditional plan in parallel using an ASP solver. In particular, thanks to external atoms, continuous feasibility checks (like collision checks) are embedded into formal representations of actuation actions and sensing actions in ASP; and thus each branch of a hybrid conditional plan describes a feasible execution of actions to reach their goals. Utilizing nonmonotonic constructs and nondeterministic choices, partial knowledge about states and nondeterministic effects of sensing actions can be explicitly formalized in ASP; and thus each branch of a conditional plan can be computed by an ASP solver without necessitating a conformant planner and an ordering of sensing actions in advance. We apply our method in a service robotics domain and report experimental evaluations. Furthermore, we present performance comparisons with other compilation based conditional planners on standardized benchmark domains. This paper is under consideration for acceptance in TPLP.
△ Less
Submitted 18 July, 2017;
originally announced July 2017.
-
Re-evaluating Automatic Metrics for Image Captioning
Authors:
Mert Kilickaya,
Aykut Erdem,
Nazli Ikizler-Cinbis,
Erkut Erdem
Abstract:
The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore th…
▽ More
The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore the utilization of the recently proposed Word Mover's Distance (WMD) document metric for the purpose of image captioning. Our findings outline the differences and/or similarities between metrics and their relative robustness by means of extensive correlation, accuracy and distraction based evaluations. Our results also demonstrate that WMD provides strong advantages over other metrics.
△ Less
Submitted 22 December, 2016;
originally announced December 2016.