Sign Up

We have space for around 80 participants.  Please sign up by filling up the following Google form.  We won’t charge any registration fee.

Registration closed.  Please contact Dr. Yi-Hsuan Yang ( if you really want to come.

The workshop will be held in Room 106, Institute of Information Science (IIS), Academia Sinica.  (Note that it’s not in CITI).


9am / Opening

9:10am / Session 1


Andreas Rauber

Repeatability Challenges in MIR Research


Alexander Schindler

Music Video Analysis and Retrieval


Rudolf Mayer

Music and Lyrics – Multi-modal Analysis of Music

10:50am / Session 2

Session chair: Li Su


Jen-Tzung Chien

Bayesian Learning for Singing-Voice Separation


Yu-Ren Chien

Alignment of Lyrics With Accompanied Singing Audio Based on Acoustic-Phonetic Vowel Likelihood Modeling


Tak-Shing Thomas Chan

Guided Source Separation for Machine Listening and Brain-Computer Music Interfacing


Zhe-Cheng Fan

DNN for Singing Voice Separation and Audio Melody Extraction

2pm / Session 3


Markus Schedl

Music Retrieval and Recommendation via Social Media Mining


Peter Knees

Only Personalized Retrieval can be Semantic Retrieval


Eva Zangerle

The #nowplaying Dataset in the Context of Recommender Systems

4pm / Session 4

Session chair: Thomas Chan


Andreu Vall

Fusing Web and Audio Predictors to Localize the Origin of Music Pieces for Geospatial Retrieval


Tzu-Chun Yeh

AutoRhythm: A Music Game With Automatic Hit-Timing Generation And Percussion Identification


Chun-Ta Chen

Polyphonic Audio-To-Score Alignment Using Onset Detection And Constant Q Transform


Li Su

Music Technology of the Next Generation: Automatic Music Transcription and Beyond



Andreas Rauber

Associate Professor, Vienna University of Technology

Title: Repeatability Challenges in MIR Research

Abstract: Repeatability of experimental science is an essential ingredient to establish trust in eScience processes. Only if we are able to verify these and have verified components available, we can integrate these to perform increasingly sophisticated, transparent research building on each others developments. Such validation of repeatability and trusted reuse is particularly challenging in MIR research. Contrary to many other settings, raw data (i.e. music files underlying copyright restrictions) cannot be shared easily, rendering comparison of different approaches a difficult endeavor. Secondly, researchers rely increasingly on highly dynamic data sources such as social media and dynamic audio databases for their research, making precise identification of the data used in a particular study again a challenging task. Thirdly, signal processing algorithms employed in MIR, even when following standardized descriptions, may lead to differing results due to minute variations in the specific implementation of signal processing routines. All these characteristics render repeatability and validation of MIR a highly desired but hard to attain goal. In particular we will focus on a detailed analysis of these challenges to repeatability approaches for benchmark data sharing via time-stamped and versioned data sources data identification via time-stamped queries as recommended by the Research Data Alliance (RDA) means to capture an experiment execution context and validation data to enable ex-post validation of experiments.

Alexander Schindler

PhD student, Vienna University of Technology

Title: Music Video Analysis and Retrieval

Abstract: In the second part of the last century its visual representation has become a vital part of music. Album covers became a visual mnemonic to the music enclosed. Music videos distinctively influenced our pop-culture and became a significant part of it. Music video production makes use of a wide range of film making techniques such as screen-play, directors, producers, director of photography, etc. The applied effort creates enough information such that many music genres can be guessed by the moving pictures only. Stylistic elements emerged over decades into prototypical visual descriptions of music genre specific properties. Elements related to fashion, sceneries, dance moves, etc. Advances in visual computing provide means to facilitate this information to enhance open music information retrieval problems in a multimodal way. This workshop is intended to provide an overview of the application of image and video analysis to music information retrieval tasks. By introducing the technologies, toolsets and evaluation results the following questions will be addressed: Can we extract information from music videos that is useful for music information retrieval? Which tasks can be improved or solved by adding information from the visual layer? Can the visual layer be used to search for music?

Rudolf Mayer

PhD student, Vienna University of Technology

Title: Music and Lyrics – Multi-modal Analysis of Music

Abstract: Multimedia data by definition comprises several different types of content modalities. Music specifically inherits e.g. audio at its core, text in the form of lyrics, images by means of album covers, or video in the form of music videos. Yet, in many Music Information Retrieval applications, only the audio content is utilised. Recent studies have shown the usefulness of incorporating other modalities. Often, textual information in the form of song lyrics or artist biographies, were employed. Lyrics of music may be orthogonal to its sound, and differ greatly from other texts regarding their (rhyme) structure. Lyrics can thus be anaylsed in many different means, by standard bag-of-words approaches, or approaches also taking into account style and rhymes. The exploitation of these properties has potential for typical music information retrieval tasks such as musical genre classification. Specifically of use can be the combination of features extracted from lyrics with the audio content, or further modalities. Lyrics can also be interesting from cross-language aspects – sometimes there are cover versions with lyrics in a different language. In these cover versions, the message of the song, and the mood and emotions created by the lyrics, might be different than in the original version. This offers further interesting research opportunities for multi-modal analysis, and calls for a stronger focus on the lyrics analysis.

Jen-Tzung Chien

Professor, National Chiao Tung University

Title: Bayesian Learning for Singing-Voice Separation

Abstract: This talk presents a Bayesian nonnegative matrix factorization (NMF) approach to extract singing voice from background music accompaniment. Using this approach, the likelihood function based on NMF is represented by a Poisson distribution and the NMF parameters, consisting of basis and weight matrices, are characterized by the exponential priors. A variational Bayesian expectation-maximization algorithm is developed to learn variational parameters and model parameters for monaural source separation. A clustering algorithm is performed to establish two groups of bases: one is for singing voice and the other is for background music. Model complexity is controlled by adaptively selecting the number of bases for different mixed signals according to the variational lower bound. Model regularization is tackled through the uncertainty modeling via variational inference based on marginal likelihood. We will show the experimental results on MIR-1K database.

Yu-Ren Chien

Postdoc Researcher, Institute of Information Science, Academia Sinica

Title: Alignment of Lyrics With Accompanied Singing Audio Based on Acoustic-Phonetic Vowel Likelihood Modeling

Abstract: Here at the SLAM lab, I have been working on the task of aligning lyrics with accompanied singing recordings.  With a vowel-only representation of lyric syllables, my approach evaluates likelihood scores of vowel types with glottal pulse shapes and formant frequencies extracted from a small set of singing examples.  The proposed vowel likelihood model is used in conjunction with a prior model of frame-wise syllable sequence in determining an optimal evolution of syllabic position.  New objective performance measures are introduced in the evaluation to provide further insight into the quality of alignment.  Use of glottal pulse shapes and formant frequencies is shown by a controlled experiment to account for a 0.07 difference in average normalized alignment error.  Another controlled experiment demonstrates that, with a difference of 0.03, F0-invariant glottal pulse shape gives a lower average normalized alignment error than does F0-invariant spectrum envelope, the latter being assumed by MFCC-based timbre models.

Tak-Shing Thomas Chan

Postdoc Researcher, Research Center for IT Innovation, Academia Sinica

Title: Guided Source Separation for Machine Listening and Brain-Computer Music Interfacing

Abstract: As an introvert who could not communicate with people in a (British) pub, I had begun to investigate the cocktail party problem, also known as source separation, in 2002 ( Source separation entails the recovery of the original signals given only the mixed signals. In particular, musical signal separation and brain signal separation are difficult problems because there are more sources than sensors. For a better separation, additional guidance is needed in the form of source models and side information. In this talk, we will present our current and future work on guided source separation as applied to musical and brain signals, with many potential applications for music information retrieval. As I am also an amateur singer and pianist (, this talk will mostly concentrate on our singing voice separation work.


Zhe-Cheng Fan

PhD student, National Taiwan University

Title: DNN for Singing Voice Separation and Audio Melody Extraction

Abstract: With the explosive growth of audio music everywhere over the Internet, it is becoming more important to be able to classify or retrieve audio music based on their key components, such as vocal pitch for common popular music. In this talk, I am going to describe an effective two-stage approach to singing pitch extraction, which involves singing voice separation and pitch tracking for monaural polyphonic audio music. The approach has been submitted to the singing voice separation and audio melody extraction tasks of Music Information Retrieval Evaluation eXchange (MIREX) in 2015. The results of the competition shows that the proposed approach is superior to other submitted algorithms, which demonstrates the feasibility of the method for further applications in music processing.


Markus Schedl

Associate Professor, JKU Linz

Title: Music Retrieval and Recommendation via Social Media Mining

Abstract: Social media represent an unprecedented source of information about every topic of our daily lives. Since music plays a vital role for almost everyone, information about music items and artists is found in abundance in user-generated data. In this talk, I will report on our recent research on exploiting social media to extract music-related information, aiming to improve music retrieval and recommendation. More precisely, I will elaborate on the following questions: Which factors are important to human perception of music? How to extract and annotate music listening events from social media, in particular microblogs? What can this kind of data tell us about the music taste of people around the world? How to make accessible music listening data from social media in an intuitive way? How to build music recommenders tailored to user characteristics?

Peter Knees

Assistant Professor, JKU Linz

Title: Only Personalized Retrieval can be Semantic Retrieval or: What Music Producers Want from Retrieval and Recommender Systems

Abstract: Sample retrieval remains a central problem in the creative process of making electronic music. In this talk, I am going to describe the findings from a series of interview sessions involving users working creatively with electronic music. In the context of the GiantSteps project, we conduct in-depth interviews with expert users on location at the Red Bull Music Academy. When asked about their wishes and expectations for future technological developments in interfaces, most participants mentioned very practical requirements of storing and retrieving files. It becomes apparent that for music interfaces for creative expression, traditional requirements and paradigms for music and audio retrieval differ to those from consumer-centered MIR tasks such as playlist generation and recommendation and that new paradigms need to be considered. Despite all technical aspects being controllable by the experts themselves, searching for sounds to use in composition remains a largely semantic process. The desired systems need to exhibit a high degree of adaptability to the individual needs of creative users.


Eva Zangerle

Assistant Professor, University of Innsbruck

Title: The #nowplaying Dataset in the Context of Recommender Systems

Abstract: The recommendation of musical tracks to users has been tackled by research from various angles. Recently, incorporating contextual information in the process of eliciting recommendation candidates has proven to be useful. In this talk, we report on our analyses on the effectiveness of affective contextual information for ranking track recommendation candidates. Particularly, we performed an evaluation of such an approach based on a dataset gathered from so-called #nowplaying tweets and looked into how incorporating affective information extracted from hashtags within these tweets can contribute to a better ranking of music recommendation candidates. We model the given data as a graph and subsequently exploit latent features computed based on this graph. We find that exploiting affective information about the user’s mood can improve the performance of the ranking function substantially.

Andreu Vall

PhD student, JKU Linz

Title: Fusing Web and Audio Predictors to Localize the Origin of Music Pieces for Geospatial Retrieval

Abstract: Localizing the origin of a music piece around the world enables some interesting possibilities for geospatial music retrieval, for instance, location-aware music retrieval or recommendation for travelers or  exploring non-Western music — a task neglected for a long time in music information retrieval (MIR). While previous approaches for the task of determining the origin of music either focused solely on exploiting the audio content or  web resources, we propose a method that fuses features from both sources in a way that outperforms stand-alone approaches. To this end, we propose the use of block-level features inferred from the audio signal to model music content. We show that these features outperform timbral and chromatic features previously used for the task. On the other hand, we investigate a variety of strategies to construct web-based predictors from web pages related to music pieces. We assess different parameters for this kind of predictors (e.g., number of web pages considered) and define a confidence threshold for prediction. Fusing the proposed audio- and web-based methods by a weighted Borda rank aggregation technique, we show on a previously used dataset of music from 33 countries around the world that the median placing error can be substantially reduced using K-nearest neighbor regression.

Tzu-Chun Yeh

PhD student, National Tsing Hua University

Title: AutoRhythm: A Music Game With Automatic Hit-Timing Generation And Percussion Identification

Abstract: In this talk, we will introduce a music rhythm game called AutoRhythm, which can automatically generate the hit timing for a rhythm game from a given piece of music, and identify user-defined percussion of real objects in real time. More specifically, AutoRhythm can automatically generate the beat timing of a piece of music via server-based computation, such that users can use any song from their personal music collection in a rhythm game. Moreover, to make such games more realistic, AutoRhythm allows users to interact with the game via any object that can produce a percussion sound, such as a pen or a chopstick hitting against a table. AutoRhythm can identify the percussions in real time while the music is playing. This identification is based on the power spectrum of each frame of the filtered recording obtained from active noise cancellation, where the estimated noisy playback music is subtracted from the original recording.

Chun-Ta Chen

PhD student, National Tsing Hua University

Title: Polyphonic Audio-To-Score Alignment Using Onset Detection And Constant Q Transform

Abstract: We proposes an innovative method that aligns a polyphonic audio recording of music to its corresponding symbolic score. In the first step, we perform onset detection and then apply constant Q transform around each onset. A similarity matrix is computed by using a scoring function which evaluates the similarity between notes in the music score and onsets in the audio recording. At last, we use dynamic programming to extract the best alignment path in the similarity matrix. We compared two onset detectors and two note matching methods. Our method is more efficient and has higher precision than the traditional chroma-based DTW method. Our algorithm achieved the best precision, which are 10% higher than the compared traditional algorithm when the tolerance window is 50 ms.


Li Su

Postdoc Researcher, Research Center for IT Innovation, Academia Sinica

Title: Music Technology of the Next Generation: Automatic Music Transcription and Beyond

Abstract: Up to date, music is still a less explored area in modern digital multimedia application. When it comes to music, people always have many desired, unrealized, but forgotten dreams: Is it possible for me to learn music efficiently, happily without expensive tuition fee? Have tools make my singing voice beautiful? Make my own album by myself? Or learn to write songs easily? Solutions to all these general desires are either unseen or utilized merely by a small group of professional musicians. User-centered, personalized, portable and ubiquitous applications of smart music processing, with applications for music appreciation, music education, music gaming, music production, and even the preservation and revitalization of musical cultural heritage, is becoming the arena of cutting-edge music technologies, and will be the heart of future digital music market. Automatic music transcription (AMT), as one of the most challenging problems in machine listening of music, will play a significant role for the next wave of music technology from different perspective. Having strong power of research and development on music information retrieval (MIR) and a large amount of musicians and music producers, Taiwan has a niche for launching the next revolution of music technology, succeeding the fashion of global online music streaming lasting for one decade, by incorporating MIR technology, augmented reality (AR), internet of things (IoT), new signal processing and machine learning techniques, and creative ideas. This emerging field will be found a new opportunity for Taiwan, where people are finding the future position of our information technology, cultural and creative industry, all of which are now undergoing great challenges.



FEB 22 (MONDAY) 2016



This is a one-day workshop on music information retrieval research.  The talks will be given in English.

The workshop is sponsored by the bilateral MOST-FWF joint seminar program (the national academic funding agencies of Taiwan and Austria), and the Research Center for IT Innovation, Academia Sinica.

This workshop will take place in Room 106, Institute of Information Science (資訊所), Academia Sinica.

Contact: Dr. Yi-Hsuan Yang (