Skip to main content

Showing 1–41 of 41 results for author: Erdem, A

  1. arXiv:2407.12498  [pdf, other

    cs.CL cs.CV

    Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

    Authors: Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem

    Abstract: The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy of few-shot In-Context Learning (ICL), and Chain-of-Thought (CoT) prompting. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in mode… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: Preprint. 33 pages, 17 Figures, 3 Tables

  2. arXiv:2406.09368  [pdf, other

    cs.CV

    CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

    Authors: Yigit Ekin, Ahmet Burak Yildirim, Erdem Eren Caglar, Aykut Erdem, Erkut Erdem, Aysegul Dundar

    Abstract: Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images.… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://yigitekin.github.io/CLIPAway/

  3. arXiv:2405.00878  [pdf, other

    cs.CV

    SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models

    Authors: Burak Can Biner, Farrin Marouf Sofian, Umur Berkay Karakaş, Duygu Ceylan, Erkut Erdem, Aykut Erdem

    Abstract: We are witnessing a revolution in conditional image synthesis with the recent success of large scale text-to-image generation methods. This success also opens up new opportunities in controlling the generation and editing process using multi-modal input. While spatial control using cues such as depth, sketch, and other images has attracted a lot of research, we argue that another equally effective… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

  4. arXiv:2404.16621  [pdf, other

    cs.LG cs.AI cs.CL

    Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare

    Authors: Emre Can Acikgoz, Osman Batur İnce, Rayene Bench, Arda Anıl Boz, İlker Kesen, Aykut Erdem, Erkut Erdem

    Abstract: The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

  5. arXiv:2404.12013  [pdf, other

    cs.CL

    Sequential Compositional Generalization in Multimodal Models

    Authors: Semih Yagcioglu, Osman Batur İnce, Aykut Erdem, Erkut Erdem, Desmond Elliott, Deniz Yuret

    Abstract: The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address thi… ▽ More

    Submitted 18 April, 2024; originally announced April 2024.

    Comments: Accepted to the main conference of NAACL (2024) as a long paper

  6. arXiv:2311.07022  [pdf, other

    cs.CL cs.AI cs.CV

    ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

    Authors: Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt, Aykut Erdem, Erkut Erdem

    Abstract: With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm foo… ▽ More

    Submitted 12 November, 2023; originally announced November 2023.

    Comments: Preprint. 48 pages, 22 figures, 10 tables

  7. arXiv:2310.12118  [pdf, other

    cs.CL

    Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

    Authors: Osman Batur İnce, Tanin Zeraati, Semih Yagcioglu, Yadollah Yaghoobzadeh, Erkut Erdem, Aykut Erdem

    Abstract: Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harness… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: Accepted to Findings of EMNLP 2023

  8. Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

    Authors: Orhan Torun, Seniha Esen Yuksel, Erkut Erdem, Nevrez Imamoglu, Aykut Erdem

    Abstract: Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Henc… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Journal ref: Signal Processing, Volume 214, January 2024, 109248

  9. arXiv:2308.13004  [pdf, other

    cs.CV cs.AI cs.MM

    Spherical Vision Transformer for 360-degree Video Saliency Prediction

    Authors: Mert Cokelek, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem

    Abstract: The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectiona… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: 12 pages, 4 figures, accepted to BMVC 2023

  10. arXiv:2307.08397  [pdf, other

    cs.CV

    CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

    Authors: Ahmet Canberk Baykal, Abdul Basit Anees, Duygu Ceylan, Erkut Erdem, Aykut Erdem, Deniz Yuret

    Abstract: Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. H… ▽ More

    Submitted 18 July, 2023; v1 submitted 17 July, 2023; originally announced July 2023.

    Comments: Accepted for publication in ACM Transactions on Graphics

  11. HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks

    Authors: Burak Ercan, Onur Eker, Canberk Saglam, Aykut Erdem, Erkut Erdem

    Abstract: Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our appro… ▽ More

    Submitted 20 February, 2024; v1 submitted 10 May, 2023; originally announced May 2023.

    Comments: 20 pages, 11 figures. Accepted by IEEE Transactions on Image Processing. The project page can be found at https://ercanburak.github.io/HyperE2VID.html

    Journal ref: IEEE Trans. Image Process., 33 (2024), 1826-1837

  12. EVREAL: Towards a Comprehensive Benchmark and Analysis Suite for Event-based Video Reconstruction

    Authors: Burak Ercan, Onur Eker, Aykut Erdem, Erkut Erdem

    Abstract: Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep lea… ▽ More

    Submitted 5 April, 2024; v1 submitted 30 April, 2023; originally announced May 2023.

    Comments: 19 pages, 9 figures. Has been accepted for publication at the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, 2023. The project page can be found at https://ercanburak.github.io/evreal.html

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 3942-3951. 2023

  13. arXiv:2304.06020  [pdf, other

    cs.CV

    VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs

    Authors: Moayed Haji Ali, Andrew Bond, Tolga Birdal, Duygu Ceylan, Levent Karacan, Erkut Erdem, Aykut Erdem

    Abstract: We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hi… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

    Journal ref: ICCV 2023

  14. arXiv:2304.03246  [pdf, other

    cs.CV

    Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

    Authors: Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar

    Abstract: Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we ar… ▽ More

    Submitted 9 August, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

  15. arXiv:2303.06907  [pdf, other

    cs.CV eess.IV

    ST360IQ: No-Reference Omnidirectional Image Quality Assessment with Spherical Vision Transformers

    Authors: Nafiseh Jabbari Tofighi, Mohamed Hedi Elfkir, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem

    Abstract: Omnidirectional images, aka 360 images, can deliver immersive and interactive visual experiences. As their popularity has increased dramatically in recent years, evaluating the quality of 360 images has become a problem of interest since it provides insights for capturing, transmitting, and consuming this new media. However, directly adapting quality assessment methods proposed for standard natura… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  16. arXiv:2211.04576  [pdf, other

    cs.CL cs.AI

    Detecting Euphemisms with Literal Descriptions and Visual Imagery

    Authors: İlker Kesen, Aykut Erdem, Erkut Erdem, Iacer Calixto

    Abstract: This paper describes our two-stage system for the Euphemism Detection shared task hosted by the 3rd Workshop on Figurative Language Processing in conjunction with EMNLP 2022. Euphemisms tone down expressions about sensitive or unpleasant issues like addiction and death. The ambiguous nature of euphemistic words or expressions makes it challenging to detect their actual meaning within a context. In… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: 7 pages, 1 table, 1 figure. Accepted to the 3rd Workshop on Figurative Language Processing at EMNLP 2022. https://github.com/ilkerkesen/euphemism

  17. arXiv:2211.02980  [pdf, other

    cs.CV

    Disentangling Content and Motion for Text-Based Neural Video Manipulation

    Authors: Levent Karacan, Tolga Kerimoğlu, İsmail İnan, Tolga Birdal, Erkut Erdem, Aykut Erdem

    Abstract: Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating vi… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

  18. arXiv:2209.08564  [pdf, other

    cs.CV cs.LG eess.IV eess.SP

    Perception-Distortion Trade-off in the SR Space Spanned by Flow Models

    Authors: Cansu Korkmaz, A. Murat Tekalp, Zafer Dogan, Erkut Erdem, Aykut Erdem

    Abstract: Flow-based generative super-resolution (SR) models learn to produce a diverse set of feasible SR solutions, called the SR space. Diversity of SR solutions increases with the temperature ($τ$) of latent variables, which introduces random variations of texture among sample solutions, resulting in visual artifacts and low fidelity. In this paper, we present a simple but effective image ensembling/fus… ▽ More

    Submitted 18 September, 2022; originally announced September 2022.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICIP 2022 Conference

  19. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  20. arXiv:2108.02760  [pdf, other

    cs.CV

    SLAMP: Stochastic Latent Appearance and Motion Prediction

    Authors: Adil Kaan Akan, Erkut Erdem, Aykut Erdem, Fatma Güney

    Abstract: Motion is an important cue for video prediction and often utilized by separating video content into static and dynamic components. Most of the previous work utilizing motion is deterministic but there are stochastic methods that can model the inherent uncertainty of the future. Existing stochastic models either do not reason about motion explicitly or make limiting assumptions about the static par… ▽ More

    Submitted 5 August, 2021; originally announced August 2021.

    Comments: ICCV 2021

  21. arXiv:2102.07682  [pdf, other

    cs.CV

    A Gated Fusion Network for Dynamic Saliency Prediction

    Authors: Aysun Kocak, Erkut Erdem, Aykut Erdem

    Abstract: Predicting saliency in videos is a challenging problem due to complex modeling of interactions between spatial and temporal information, especially when ever-changing, dynamic nature of videos is considered. Recently, researchers have proposed large-scale datasets and models that take advantage of deep learning as a way to understand what's important for video saliency. These approaches, however,… ▽ More

    Submitted 15 February, 2021; originally announced February 2021.

    Comments: Project page: https://hucvl.github.io/GFSalNet/

  22. Object and Relation Centric Representations for Push Effect Prediction

    Authors: Ahmet E. Tekden, Aykut Erdem, Erkut Erdem, Tamim Asfour, Emre Ugur

    Abstract: Pushing is an essential non-prehensile manipulation skill used for tasks ranging from pre-grasp manipulation to scene rearrangement, reasoning about object relations in the scene, and thus pushing actions have been widely studied in robotics. The effective use of pushing actions often requires an understanding of the dynamics of the manipulated objects and adaptation to the discrepancies between p… ▽ More

    Submitted 22 February, 2023; v1 submitted 3 February, 2021; originally announced February 2021.

    Comments: Project Page: https://fzaero.github.io/push_learning/

  23. arXiv:2101.10044  [pdf, other

    cs.CL cs.CV

    Cross-lingual Visual Pre-training for Multimodal Machine Translation

    Authors: Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, Lucia Specia

    Abstract: Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the… ▽ More

    Submitted 20 April, 2021; v1 submitted 25 January, 2021; originally announced January 2021.

    Comments: Accepted to EACL 2021 (Camera-ready version)

  24. arXiv:2012.07098  [pdf, other

    cs.CV

    MSVD-Turkish: A Comprehensive Multimodal Dataset for Integrated Vision and Language Research in Turkish

    Authors: Begum Citamak, Ozan Caglayan, Menekse Kuyu, Erkut Erdem, Aykut Erdem, Pranava Madhyastha, Lucia Specia

    Abstract: Automatic generation of video descriptions in natural language, also called video captioning, aims to understand the visual content of the video and produce a natural language sentence depicting the objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other… ▽ More

    Submitted 13 December, 2020; originally announced December 2020.

  25. arXiv:2012.04293  [pdf, other

    cs.AI cs.CL cs.CV

    CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

    Authors: Tayfun Ates, M. Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, Deniz Yuret

    Abstract: Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and q… ▽ More

    Submitted 1 March, 2022; v1 submitted 8 December, 2020; originally announced December 2020.

    Comments: Accepted to Findings of ACL 2022

  26. Burst Photography for Learning to Enhance Extremely Dark Images

    Authors: Ahmet Serdar Karadeniz, Erkut Erdem, Aykut Erdem

    Abstract: Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quali… ▽ More

    Submitted 19 November, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

    Comments: Published in IEEE Transactions on Image Processing

  27. arXiv:2003.12739  [pdf, other

    cs.CV cs.CL cs.LG

    Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters

    Authors: İlker Kesen, Ozan Arkan Can, Erkut Erdem, Aykut Erdem, Deniz Yuret

    Abstract: How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from p… ▽ More

    Submitted 23 June, 2022; v1 submitted 28 March, 2020; originally announced March 2020.

    Comments: 13 pages, 6 figures, 6 tables. Appeared in MULA Workshop at CVPR 2022

    Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4610-4620

  28. arXiv:2003.07823   

    cs.CV

    Burst Denoising of Dark Images

    Authors: Ahmet Serdar Karadeniz, Erkut Erdem, Aykut Erdem

    Abstract: Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional image enhancement techniques almost impossible to apply. Very recently, researchers have shown promising results using learning based approaches. Motivated by these ideas, in this paper, we propose a deep learning framewo… ▽ More

    Submitted 18 June, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

    Comments: This paper has been withdrawn by the authors to be replaced by a new version available at arXiv:2006.09845

  29. arXiv:1909.11504  [pdf, other

    eess.IV cs.CV

    mustGAN: Multi-Stream Generative Adversarial Networks for MR Image Synthesis

    Authors: Mahmut Yurt, Salman Ul Hassan Dar, Aykut Erdem, Erkut Erdem, Tolga Çukur

    Abstract: Multi-contrast MRI protocols increase the level of morphological information available for diagnosis. Yet, the number and quality of contrasts is limited in practice by various factors including scan time and patient motion. Synthesis of missing or corrupted contrasts can alleviate this limitation to improve clinical utility. Common approaches for multi-contrast MRI involve either one-to-one and m… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

  30. arXiv:1909.08859  [pdf, other

    cs.CL cs.CV

    Procedural Reasoning Networks for Understanding Multimodal Procedures

    Authors: Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, Erkut Erdem

    Abstract: This paper addresses the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to p… ▽ More

    Submitted 19 September, 2019; originally announced September 2019.

    Comments: Accepted to CoNLL 2019. The project website with code and demo is available at https://hucvl.github.io/prn/

  31. arXiv:1909.03785  [pdf, other

    cs.RO

    Belief Regulated Dual Propagation Nets for Learning Action Effects on Groups of Articulated Objects

    Authors: Ahmet E. Tekden, Aykut Erdem, Erkut Erdem, Mert Imre, M. Yunus Seker, Emre Ugur

    Abstract: In recent years, graph neural networks have been successfully applied for learning the dynamics of complex and partially observable physical systems. However, their use in the robotics domain is, to date, still limited. In this paper, we introduce Belief Regulated Dual Propagation Networks (BRDPN), a general-purpose learnable physics engine, which enables a robot to predict the effects of its acti… ▽ More

    Submitted 16 March, 2020; v1 submitted 9 September, 2019; originally announced September 2019.

    Comments: Accepted to ICRA 2020. Project page: https://fzaero.github.io/BRDPN/ , Video: https://youtu.be/uWPr7IFT_9k

  32. arXiv:1809.00812  [pdf, other

    cs.CL cs.CV

    RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes

    Authors: Semih Yagcioglu, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis

    Abstract: Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically… ▽ More

    Submitted 4 September, 2018; originally announced September 2018.

    Comments: EMNLP 2018

  33. arXiv:1808.07413  [pdf, other

    cs.CV

    Manipulating Attributes of Natural Scenes via Hallucination

    Authors: Levent Karacan, Zeynep Akata, Aykut Erdem, Erkut Erdem

    Abstract: In this study, we explore building a two-stage framework for enabling users to directly manipulate high-level attributes of a natural scene. The key to our approach is a deep generative network which can hallucinate images of a scene as if they were taken at a different season (e.g. during winter), weather condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the scene is hall… ▽ More

    Submitted 9 October, 2019; v1 submitted 22 August, 2018; originally announced August 2018.

    Comments: Accepted for publication in ACM Transactions on Graphics

  34. arXiv:1808.04000  [pdf, other

    cs.CV

    Language Guided Fashion Image Manipulation with Feature-wise Transformations

    Authors: Mehmet Günel, Erkut Erdem, Aykut Erdem

    Abstract: Developing techniques for editing an outfit image through natural sentences and accordingly generating new outfits has promising applications for art, fashion and design. However, it is considered as a certainly challenging task since image manipulation should be carried out only on the relevant parts of the image while keeping the remaining sections untouched. Moreover, this manipulation process… ▽ More

    Submitted 12 August, 2018; originally announced August 2018.

    Comments: Accepted to ECCV 2018, First Workshop on Computer Vision For Fashion, Art and Design (extended version)

  35. arXiv:1802.01221  [pdf

    cs.CV

    Image Synthesis in Multi-Contrast MRI with Conditional Generative Adversarial Networks

    Authors: Salman Ul Hassan Dar, Mahmut Yurt, Levent Karacan, Aykut Erdem, Erkut Erdem, Tolga Çukur

    Abstract: Acquiring images of the same anatomy with multiple different contrasts increases the diversity of diagnostic information available in an MR exam. Yet, scan time limitations may prohibit acquisition of certain contrasts, and images for some contrast may be corrupted by noise and artifacts. In such cases, the ability to synthesize unacquired or corrupted contrasts from remaining contrasts can improv… ▽ More

    Submitted 4 February, 2018; originally announced February 2018.

  36. arXiv:1612.07600  [pdf, other

    cs.CL cs.CV

    Re-evaluating Automatic Metrics for Image Captioning

    Authors: Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, Erkut Erdem

    Abstract: The task of generating natural language descriptions from images has received a lot of attention in recent years. Consequently, it is becoming increasingly important to evaluate such image captioning approaches in an automatic manner. In this paper, we provide an in-depth evaluation of the existing image captioning metrics through a series of carefully designed experiments. Moreover, we explore th… ▽ More

    Submitted 22 December, 2016; originally announced December 2016.

  37. arXiv:1612.00215  [pdf, other

    cs.CV

    Learning to Generate Images of Outdoor Scenes from Attributes and Semantic Layouts

    Authors: Levent Karacan, Zeynep Akata, Aykut Erdem, Erkut Erdem

    Abstract: Automatic image synthesis research has been rapidly growing with deep networks getting more and more expressive. In the last couple of years, we have observed images of digits, indoor scenes, birds, chairs, etc. being automatically generated. The expressive power of image generators have also been enhanced by introducing several forms of conditioning variables such as object names, sentences, boun… ▽ More

    Submitted 1 December, 2016; originally announced December 2016.

  38. arXiv:1607.04730  [pdf, other

    cs.CV

    Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction

    Authors: Cagdas Bak, Aysun Kocak, Erkut Erdem, Aykut Erdem

    Abstract: Computational saliency models for still images have gained significant popularity in recent years. Saliency prediction from videos, on the other hand, has received relatively little interest from the community. Motivated by this, in this work, we study the use of deep learning for dynamic saliency prediction and propose the so-called spatio-temporal saliency networks. The key to our models is the… ▽ More

    Submitted 15 November, 2017; v1 submitted 16 July, 2016; originally announced July 2016.

  39. arXiv:1601.03896  [pdf, ps, other

    cs.CL cs.CV

    Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures

    Authors: Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, Barbara Plank

    Abstract: Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a vis… ▽ More

    Submitted 24 April, 2017; v1 submitted 15 January, 2016; originally announced January 2016.

    Comments: Journal of Artificial Intelligence Research 55, 409-442, 2016

  40. arXiv:1307.5693  [pdf, other

    cs.CV

    Visual saliency estimation by integrating features using multiple kernel learning

    Authors: Yasin Kavak, Erkut Erdem, Aykut Erdem

    Abstract: In the last few decades, significant achievements have been attained in predicting where humans look at images through different computational models. However, how to determine contributions of different visual features to overall saliency still remains an open problem. To overcome this issue, a recent class of models formulates saliency estimation as a supervised learning problem and accordingly… ▽ More

    Submitted 22 July, 2013; originally announced July 2013.

    Report number: ISACS/2013/03

  41. arXiv:1104.2751  [pdf, other

    cs.CV

    Disconnected Skeleton: Shape at its Absolute Scale

    Authors: C. Aslan, A. Erdem, E. Erdem, S. Tari

    Abstract: We present a new skeletal representation along with a matching framework to address the deformable shape recognition problem. The disconnectedness arises as a result of excessive regularization that we use to describe a shape at an attainably coarse scale. Our motivation is to rely on the stable properties of the shape instead of inaccurately measured secondary details. The new representation does… ▽ More

    Submitted 14 April, 2011; originally announced April 2011.

    Comments: The work excluding §V and §VI has first appeared in 2005 ICCV: Aslan, C., Tari, S.: An Axis-Based Representation for Recognition. In ICCV(2005) 1339- 1346.; Aslan, C., : Disconnected Skeletons for Shape Recognition. Masters thesis, Department of Computer Engineering, Middle East Technical University, May 2005

    Journal ref: T-PAMI vol. 30 no. 12, pp. 2188-2203, 2008