Skip to main content

Showing 1–46 of 46 results for author: Engel, J

  1. arXiv:2406.09905  [pdf, other

    cs.CV cs.GR

    Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild

    Authors: Lingni Ma, Yuting Ye, Fangzhou Hong, Vladimir Guzov, Yifeng Jiang, Rowan Postyeni, Luis Pesqueira, Alexander Gamino, Vijay Baiyya, Hyo Jin Kim, Kevin Bailey, David Soriano Fosas, C. Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, Richard Newcombe

    Abstract: We introduce Nymeria - a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices. The dataset comes with a) full-body 3D motion ground truth; b) egocentric multimodal recordings from Project Aria devices with RGB, grayscale, eye-tracking cameras, IMUs, magnetometer, barometer, and microphones; and c) an additional "observer" dev… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  2. arXiv:2406.09598  [pdf, other

    cs.CV

    Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking

    Authors: Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Fan Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, Tomas Hodan

    Abstract: We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (more than 3.7M images) of multi-view RGB/monochrome image streams showing 19 subjects interacting with 33 diverse rigid objects, multi-modal signals such as eye gaze or scene point clouds, as well as comprehensive ground truth annotations including 3D poses of object… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  3. arXiv:2403.13064  [pdf, other

    cs.CV

    SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

    Authors: Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, Vasileios Balntas

    Abstract: We introduce SceneScript, a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach. Our proposed scene representation is inspired by recent successes in transformers & LLMs, and departs from more traditional methods which commonly describe scenes as meshes, voxel grids, point clouds or radiance fields. Our method… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: see project page, https://projectaria.com/scenescript

  4. arXiv:2402.13349  [pdf, other

    cs.CV cs.AI cs.HC

    Aria Everyday Activities Dataset

    Authors: Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, Kiran Somasundaram, Luis Pesqueira, Mark Schwesinger, Omkar Parkhi, Qiao Gu, Renzo De Nardi, Shangyi Cheng, Steve Saarinen, Vijay Baiyya, Yuyang Zou, Richard Newcombe, Jakob Julian Engel, Xiaqing Pan, Carl Ren

    Abstract: We present Aria Everyday Activities (AEA) Dataset, an egocentric multimodal open dataset recorded using Project Aria glasses. AEA contains 143 daily activity sequences recorded by multiple wearers in five geographically diverse indoor locations. Each of the recording contains multimodal sensor data recorded through the Project Aria glasses. In addition, AEA provides machine perception data includi… ▽ More

    Submitted 21 February, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: Dataset website: https://www.projectaria.com/datasets/aea/

  5. arXiv:2311.18259  [pdf, other

    cs.CV cs.AI

    Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

    Authors: Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain , et al. (76 additional authors not shown)

    Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from… ▽ More

    Submitted 29 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

    Comments: updated baseline results and dataset statistics to match the released v2 data; added table to appendix comparing stats of Ego-Exo4D alongside other datasets

  6. arXiv:2309.08803  [pdf, other

    cs.RO eess.SP

    Robust Indoor Localization with Ranging-IMU Fusion

    Authors: Fan Jiang, David Caruso, Ashutosh Dhekne, Qi Qu, Jakob Julian Engel, Jing Dong

    Abstract: Indoor wireless ranging localization is a promising approach for low-power and high-accuracy localization of wearable devices. A primary challenge in this domain stems from non-line of sight propagation of radio waves. This study tackles a fundamental issue in wireless ranging: the unpredictability of real-time multipath determination, especially in challenging conditions such as when there is no… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  7. arXiv:2308.13561  [pdf, other

    cs.HC cs.CV

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Authors: Jakob Engel, Kiran Somasundaram, Michael Goesele, Albert Sun, Alexander Gamino, Andrew Turner, Arjang Talattof, Arnie Yuan, Bilal Souti, Brighid Meredith, Cheng Peng, Chris Sweeney, Cole Wilson, Dan Barnes, Daniel DeTone, David Caruso, Derek Valleroy, Dinesh Ginjupalli, Duncan Frost, Edward Miller, Elias Mueggler, Evgeniy Oleinik, Fan Zhang, Guruprasad Somasundaram, Gustavo Solaira , et al. (49 additional authors not shown)

    Abstract: Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, mul… ▽ More

    Submitted 1 October, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

  8. arXiv:2302.03917  [pdf, other

    cs.SD cs.LG eess.AS

    Noise2Music: Text-conditioned Music Generation with Diffusion Models

    Authors: Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han

    Abstract: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and… ▽ More

    Submitted 6 March, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: 15 pages

  9. arXiv:2301.12662  [pdf, other

    cs.SD cs.AI cs.LG cs.MM eess.AS

    SingSong: Generating musical accompaniments from singing

    Authors: Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, Jesse Engel

    Abstract: We present SingSong, a system that generates instrumental music to accompany input vocals, potentially offering musicians and non-musicians alike an intuitive new way to create music featuring their own voice. To accomplish this, we build on recent developments in musical source separation and audio generation. Specifically, we apply a state-of-the-art source separation algorithm to a large corpus… ▽ More

    Submitted 29 January, 2023; originally announced January 2023.

  10. arXiv:2301.11325  [pdf, other

    cs.SD cs.LG eess.AS

    MusicLM: Generating Music From Text

    Authors: Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank

    Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous s… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps

  11. arXiv:2212.08038  [pdf, ps, other

    cs.CY

    Redefining Relationships in Music

    Authors: Christian Detweiler, Beth Coleman, Fernando Diaz, Lieke Dom, Chris Donahue, Jesse Engel, Cheng-Zhi Anna Huang, Larry James, Ethan Manilow, Amanda McCroskery, Kyle Pedersen, Pamela Peter-Agbia, Negar Rostamzadeh, Robert Thomas, Marco Zamarato, Ben Zevenbergen

    Abstract: AI tools increasingly shape how we discover, make and experience music. While these tools can have the potential to empower creativity, they may fundamentally redefine relationships between stakeholders, to the benefit of some and the detriment of others. In this position paper, we argue that these tools will fundamentally reshape our music culture, with profound effects (for better and for worse)… ▽ More

    Submitted 16 December, 2022; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: Presented at Cultures in AI/AI in Culture workshop at NeurIPS 2022

  12. arXiv:2209.14458  [pdf, other

    cs.SD cs.IR cs.LG eess.AS

    The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling

    Authors: Yusong Wu, Josh Gardner, Ethan Manilow, Ian Simon, Curtis Hawthorne, Jesse Engel

    Abstract: Data is the lifeblood of modern machine learning systems, including for those in Music Information Retrieval (MIR). However, MIR has long been mired by small datasets and unreliable labels. In this work, we propose to break this bottleneck using generative modeling. By pipelining a generative model of notes (Coconet trained on Bach Chorales) with a structured synthesis model of chamber ensembles (… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

  13. arXiv:2206.05408  [pdf, other

    cs.SD cs.LG eess.AS

    Multi-instrument Music Synthesis with Spectrogram Diffusion

    Authors: Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel

    Abstract: An ideal music synthesizer should be both interactive and expressive, generating high-fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural synthesizers have exhibited a tradeoff between domain-specific models that offer detailed control of only specific instruments, or raw waveform models that can train on any music but with minimal control and slow generat… ▽ More

    Submitted 12 December, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

  14. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  15. arXiv:2203.15182  [pdf, other

    cs.CV cs.RO

    Long-term Visual Map Sparsification with Heterogeneous GNN

    Authors: Ming-Fang Chang, Yipu Zhao, Rajvi Shah, Jakob J. Engel, Michael Kaess, Simon Lucey

    Abstract: We address the problem of map sparsification for long-term visual localization. For map sparsification, a commonly employed assumption is that the pre-build map and the later captured localization query are consistent. However, this assumption can be easily violated in the dynamic world. Additionally, the map size grows as new data accumulate through time, causing large data overhead in the long t… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Accepted by CVPR 2022

  16. arXiv:2203.15140  [pdf, other

    cs.SD eess.AS

    Improving Source Separation by Explicitly Modeling Dependencies Between Sources

    Authors: Ethan Manilow, Curtis Hawthorne, Cheng-Zhi Anna Huang, Bryan Pardo, Jesse Engel

    Abstract: We propose a new method for training a supervised source separation system that aims to learn the interdependent relationships between all combinations of sources in a mixture. Rather than independently estimating each source from a mix, we reframe the source separation problem as an Orderless Neural Autoregressive Density Estimator (NADE), and estimate each source from both the mix and a random s… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: To appear at ICASSP 2022

  17. arXiv:2203.03022  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    HEAR: Holistic Evaluation of Audio Representations

    Authors: Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

    Abstract: What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, in… ▽ More

    Submitted 29 May, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

    Comments: to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track

  18. arXiv:2202.07765  [pdf, other

    cs.LG cs.AI cs.CV cs.SD eess.AS

    General-purpose, long-context autoregressive modeling with Perceiver AR

    Authors: Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, João Carreira, Jesse Engel

    Abstract: Real-world data is high-dimensional: a book, image, or musical performance can easily contain hundreds of thousands of elements even after compression. However, the most commonly used autoregressive models, Transformers, are prohibitively expensive to scale to the number of inputs and layers needed to capture this long-range structure. We develop Perceiver AR, an autoregressive, modality-agnostic… ▽ More

    Submitted 14 June, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: ICML 2022

  19. arXiv:2112.09312  [pdf, other

    cs.SD cs.LG eess.AS

    MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

    Authors: Yusong Wu, Ethan Manilow, Yi Deng, Rigel Swavely, Kyle Kastner, Tim Cooijmans, Aaron Courville, Cheng-Zhi Anna Huang, Jesse Engel

    Abstract: Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments… ▽ More

    Submitted 17 March, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

    Comments: Accepted by International Conference on Learning Representations (ICLR) 2022

  20. arXiv:2111.14951  [pdf, other

    cs.HC cs.LG cs.SD eess.AS

    Expressive Communication: A Common Framework for Evaluating Developments in Generative Models and Steering Interfaces

    Authors: Ryan Louie, Jesse Engel, Anna Huang

    Abstract: There is an increasing interest from ML and HCI communities in empowering creators with better generative models and more intuitive interfaces with which to control them. In music, ML researchers have focused on training models capable of generating pieces with increasing long-range structure and musical coherence, while HCI researchers have separately focused on designing steering interfaces that… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: 15 pages, 6 figures, submitted to ACM Intelligent User Interfaces 2022 Conference

  21. arXiv:2111.03017  [pdf, other

    cs.SD cs.LG eess.AS

    MT3: Multi-Task Multitrack Music Transcription

    Authors: Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, Jesse Engel

    Abstract: Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a challenging task at the core of music understanding. Unlike Automatic Speech Recognition (ASR), which typically focuses on the words of a single speaker, AMT often requires transcribing multiple instruments simultaneously, all while preserving fine-scale pitch and timing information. Further, many AMT datasets are "l… ▽ More

    Submitted 15 March, 2022; v1 submitted 4 November, 2021; originally announced November 2021.

    Comments: ICLR 2022 camera-ready version

  22. arXiv:2107.09142  [pdf, other

    cs.SD cs.LG eess.AS

    Sequence-to-Sequence Piano Transcription with Transformers

    Authors: Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, Jesse Engel

    Abstract: Automatic Music Transcription has seen significant progress in recent years by training custom deep neural networks on large datasets. However, these models have required extensive domain-specific design of network architectures, input/output representations, and complex decoding schemes. In this work, we show that equivalent performance can be achieved using a generic encoder-decoder Transformer… ▽ More

    Submitted 19 July, 2021; originally announced July 2021.

  23. arXiv:2103.16091  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Symbolic Music Generation with Diffusion Models

    Authors: Gautam Mittal, Jesse Engel, Curtis Hawthorne, Ian Simon

    Abstract: Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizin… ▽ More

    Submitted 25 November, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: ISMIR 2021

  24. arXiv:2103.06089  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Variable-rate discrete representation learning

    Authors: Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan

    Abstract: Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that… ▽ More

    Submitted 10 March, 2021; originally announced March 2021.

    Comments: 26 pages, 15 figures, samples can be found at https://vdrl.github.io/

  25. arXiv:2007.01867  [pdf, other

    cs.RO cs.CV cs.LG eess.SP

    TLIO: Tight Learned Inertial Odometry

    Authors: Wenxin Liu, David Caruso, Eddy Ilg, Jing Dong, Anastasios I. Mourikis, Kostas Daniilidis, Vijay Kumar, Jakob Engel

    Abstract: In this work we propose a tightly-coupled Extended Kalman Filter framework for IMU-only state estimation. Strap-down IMU measurements provide relative state estimates based on IMU kinematic motion model. However the integration of measurements is sensitive to sensor bias and noise, causing significant drift within seconds. Recent research by Yan et al. (RoNIN) and Chen et al. (IONet) showed the ca… ▽ More

    Submitted 10 July, 2020; v1 submitted 5 July, 2020; originally announced July 2020.

    Comments: Correcting graph and bibliography. Adding journal reference information and DOI, in IEEE Robotics and Automation Letters

  26. arXiv:2004.00188  [pdf, other

    cs.SD cs.LG

    Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset

    Authors: Lee Callender, Curtis Hawthorne, Jesse Engel

    Abstract: We introduce the Expanded Groove MIDI dataset (E-GMD), an automatic drum transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits, making it an order of magnitude larger than similar datasets, and the first with human-performed velocity annotations. We use E-GMD to optimize classifiers for use in downstream generation by predicting expressive dynamics (velocity) and show with… ▽ More

    Submitted 1 December, 2020; v1 submitted 31 March, 2020; originally announced April 2020.

    Comments: Examples available at https://goo.gl/magenta/e-gmd-examples

  27. arXiv:2001.05171  [pdf, other

    cs.HC cs.CL cs.LG

    Teddy: A System for Interactive Review Analysis

    Authors: Xiong Zhang, Jonathan Engel, Sara Evensen, Yuliang Li, Çağatay Demiralp, Wang-Chiew Tan

    Abstract: Reviews are integral to e-commerce services and products. They contain a wealth of information about the opinions and experiences of users, which can help better understand consumer decisions and improve user experience with products and services. Today, data scientists analyze reviews by developing rules and models to extract, aggregate, and understand information embedded in the review text. How… ▽ More

    Submitted 15 January, 2020; originally announced January 2020.

    Comments: CHI'20

  28. arXiv:2001.04643  [pdf, other

    cs.LG cs.SD eess.AS eess.SP stat.ML

    DDSP: Differentiable Digital Signal Processing

    Authors: Jesse Engel, Lamtharn Hantrakul, Chenjie Gu, Adam Roberts

    Abstract: Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has be… ▽ More

    Submitted 14 January, 2020; originally announced January 2020.

  29. arXiv:1912.05537  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Encoding Musical Style with Transformer Autoencoders

    Authors: Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, Jesse Engel

    Abstract: We consider the problem of learning high-level controls over the global structure of generated sequences, particularly in the context of symbolic music generation with complex language models. In this work, we present the Transformer autoencoder, which aggregates encodings of the input data across time to obtain a global representation of style from a given performance. We show it is possible to c… ▽ More

    Submitted 30 June, 2020; v1 submitted 10 December, 2019; originally announced December 2019.

  30. arXiv:1906.05797  [pdf, other

    cs.CV cs.GR eess.IV

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Authors: Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clarkson, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra , et al. (5 additional authors not shown)

    Abstract: We introduce Replica, a dataset of 18 highly photo-realistic 3D indoor scene reconstructions at room and building scale. Each scene consists of a dense mesh, high-resolution high-dynamic-range (HDR) textures, per-primitive semantic class and instance information, and planar mirror and glass reflectors. The goal of Replica is to enable machine learning (ML) research that relies on visually, geometr… ▽ More

    Submitted 13 June, 2019; originally announced June 2019.

  31. arXiv:1905.06118  [pdf, other

    cs.SD cs.LG cs.MM eess.AS stat.ML

    Learning to Groove with Inverse Sequence Transformations

    Authors: Jon Gillick, Adam Roberts, Jesse Engel, Douglas Eck, David Bamman

    Abstract: We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using Seq2Seq and recurrent Variational Information Bottleneck (VIB) models. Though Seq2Seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix (Isola et al., 2017) and Vid2V… ▽ More

    Submitted 26 July, 2019; v1 submitted 14 May, 2019; originally announced May 2019.

    Comments: Blog post and links: https://g.co/magenta/groovae

    ACM Class: J.5; I.2

    Journal ref: Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2269-2279, 2019

  32. arXiv:1902.08710  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    GANSynth: Adversarial Neural Audio Synthesis

    Authors: Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, Adam Roberts

    Abstract: Efficient audio synthesis is an inherently difficult machine learning task, as human perception is sensitive to both global structure and fine-scale waveform coherence. Autoregressive models, such as WaveNet, model local structure at the expense of global latent structure and slow iterative sampling, while Generative Adversarial Networks (GANs), have global latent conditioning and efficient parall… ▽ More

    Submitted 14 April, 2019; v1 submitted 22 February, 2019; originally announced February 2019.

    Comments: Colab Notebook: http://goo.gl/magenta/gansynth-demo

  33. arXiv:1902.08261  [pdf, other

    cs.LG cs.NE stat.ML

    Latent Translation: Crossing Modalities by Bridging Generative Models

    Authors: Yingtao Tian, Jesse Engel

    Abstract: End-to-end optimization has achieved state-of-the-art performance on many specific problems, but there is no straight-forward way to combine pretrained models for new problems. Here, we explore improving modularity by learning a post-hoc interface between two existing models to solve a new task. Specifically, we take inspiration from neural machine translation, and cast the challenging problem of… ▽ More

    Submitted 21 February, 2019; originally announced February 2019.

  34. arXiv:1810.12247  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

    Authors: Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck

    Abstract: Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of… ▽ More

    Submitted 17 January, 2019; v1 submitted 29 October, 2018; originally announced October 2018.

    Comments: Examples available at https://goo.gl/magenta/maestro-examples

  35. arXiv:1806.00195  [pdf, other

    stat.ML cs.LG cs.SD eess.AS

    Learning a Latent Space of Multitrack Measures

    Authors: Ian Simon, Adam Roberts, Colin Raffel, Jesse Engel, Curtis Hawthorne, Douglas Eck

    Abstract: Discovering and exploring the underlying structure of multi-instrumental music using learning-based approaches remains an open problem. We extend the recent MusicVAE model to represent multitrack polyphonic measures as vectors in a latent space. Our approach enables several useful operations such as generating plausible measures from scratch, interpolating between measures in a musically meaningfu… ▽ More

    Submitted 1 June, 2018; originally announced June 2018.

  36. arXiv:1803.05428  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

    Authors: Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck

    Abstract: The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decode… ▽ More

    Submitted 11 November, 2019; v1 submitted 13 March, 2018; originally announced March 2018.

    Comments: ICML Camera Ready Version (w/ fixed typos)

    Journal ref: ICML 2018

  37. arXiv:1802.04877  [pdf, other

    cs.LG cs.CV cs.HC

    Learning via social awareness: Improving a deep generative sketching model with facial feedback

    Authors: Natasha Jaques, Jennifer McCleary, Jesse Engel, David Ha, Fred Bertsch, Rosalind Picard, Douglas Eck

    Abstract: In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster lea… ▽ More

    Submitted 27 August, 2018; v1 submitted 13 February, 2018; originally announced February 2018.

  38. arXiv:1711.05772  [pdf, other

    cs.LG cs.NE stat.ML

    Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models

    Authors: Jesse Engel, Matthew Hoffman, Adam Roberts

    Abstract: Deep generative neural networks have proven effective at both conditional and unconditional modeling of complex data distributions. Conditional generation enables interactive control, but creating new controls often requires expensive retraining. In this paper, we develop a method to condition generation without retraining the model. By post-hoc learning latent constraints, value functions that id… ▽ More

    Submitted 21 December, 2017; v1 submitted 15 November, 2017; originally announced November 2017.

  39. arXiv:1710.11153  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Onsets and Frames: Dual-Objective Piano Transcription

    Authors: Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, Douglas Eck

    Abstract: We advance the state of the art in polyphonic piano music transcription by using a deep convolutional and recurrent neural network which is trained to jointly predict onsets and frames. Our model predicts pitch onset events and then uses those predictions to condition framewise pitch predictions. During inference, we restrict the predictions from the framewise detector by not allowing a new note t… ▽ More

    Submitted 5 June, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

    Comments: Examples available at https://goo.gl/magenta/onsets-frames-examples

  40. arXiv:1704.01279  [pdf, other

    cs.LG cs.AI cs.SD

    Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

    Authors: Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, Mohammad Norouzi

    Abstract: Generative models in vision have seen rapid progress due to algorithmic improvements and the availability of high-quality image datasets. In this paper, we offer contributions in both these areas to enable similar progress in audio modeling. First, we detail a powerful new WaveNet-style autoencoder model that conditions an autoregressive decoder on temporal codes learned from the raw audio wavefor… ▽ More

    Submitted 5 April, 2017; originally announced April 2017.

  41. arXiv:1701.06063  [pdf

    cs.IT

    Opportunities for Analog Coding in Emerging Memory Systems

    Authors: Jesse H. Engel, S. Burc Eryilmaz, SangBum Kim, Matthew BrightSky, Chung Lam, Hsiang-Lan Lung, Bruno A. Olshausen, H. -S. Philip Wong

    Abstract: The exponential growth in data generation and large-scale data analysis creates an unprecedented need for inexpensive, low-latency, and high-density information storage. This need has motivated significant research into multi-level memory systems that can store multiple bits of information per device. Although both the memory state of these devices and much of the data they store are intrinsically… ▽ More

    Submitted 21 January, 2017; originally announced January 2017.

  42. From Monocular SLAM to Autonomous Drone Exploration

    Authors: Lukas von Stumberg, Vladyslav Usenko, Jakob Engel, Jörg Stückler, Daniel Cremers

    Abstract: Micro aerial vehicles (MAVs) are strongly limited in their payload and power capacity. In order to implement autonomous navigation, algorithms are therefore desirable that use sensory equipment that is as small, low-weight, and low-power consuming as possible. In this paper, we propose a method for autonomous MAV navigation and exploration using a low-cost consumer-grade quadrocopter equipped with… ▽ More

    Submitted 12 March, 2018; v1 submitted 25 September, 2016; originally announced September 2016.

    MSC Class: 68T40 ACM Class: I.2.9

  43. arXiv:1607.02565  [pdf, other

    cs.CV

    Direct Sparse Odometry

    Authors: Jakob Engel, Vladlen Koltun, Daniel Cremers

    Abstract: We propose a novel direct sparse visual odometry formulation. It combines a fully direct probabilistic model (minimizing a photometric error) with consistent, joint optimization of all model parameters, including geometry -- represented as inverse depth in a reference frame -- and camera motion. This is achieved in real time by omitting the smoothness prior used in other direct methods and instead… ▽ More

    Submitted 7 October, 2016; v1 submitted 9 July, 2016; originally announced July 2016.

    Comments: ** Corrected a bug which caused the real-time results for ORB-SLAM (dashed lines in Fig. 10 and 12) to be much worse than they should be ** Added references [12], [13],[19], and Fig. 11. ** Partly re-formulated and extended [5. Conclusion]. ** Fixed typos and minor re-formulations

  44. arXiv:1607.02555  [pdf, other

    cs.CV

    A Photometrically Calibrated Benchmark For Monocular Visual Odometry

    Authors: Jakob Engel, Vladyslav Usenko, Daniel Cremers

    Abstract: We present a dataset for evaluating the tracking accuracy of monocular visual odometry and SLAM methods. It contains 50 real-world sequences comprising more than 100 minutes of video, recorded across dozens of different environments -- ranging from narrow indoor corridors to wide outdoor scenes. All sequences contain mostly exploring camera motion, starting and ending at the same position. This al… ▽ More

    Submitted 8 October, 2016; v1 submitted 8 July, 2016; originally announced July 2016.

    Comments: * Corrected a bug in the evaluation setup, which caused the real-time results for ORB-SLAM (dashed lines in Figure 8) to be much worse than they should be. * https://vision.in.tum.de/data/datasets/mono-dataset

  45. arXiv:1603.09509  [pdf, other

    cs.CL cs.LG cs.NE cs.SD

    Learning Multiscale Features Directly From Waveforms

    Authors: Zhenyao Zhu, Jesse H. Engel, Awni Hannun

    Abstract: Deep learning has dramatically improved the performance of speech recognition systems through learning hierarchies of features optimized for the task at hand. However, true end-to-end learning, where features are learned directly from waveforms, has only recently reached the performance of hand-tailored representations based on the Fourier transform. In this paper, we detail an approach to use con… ▽ More

    Submitted 5 April, 2016; v1 submitted 31 March, 2016; originally announced March 2016.

    Comments: "fix typo in the title"

  46. arXiv:1512.02595  [pdf, other

    cs.CL

    Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

    Authors: Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh , et al. (9 additional authors not shown)

    Abstract: We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our app… ▽ More

    Submitted 8 December, 2015; originally announced December 2015.