Tutorials

We are happy to announce 6 high-level tutorials in conjunction with Interspeech 2013.

In accordance with the special focus of the conference, the programme of tutorials intends to provide a balance between computer sciences and human sciences related topics.

Interspeech 2013 also introduces a mix between tutorials on advanced topics (recent advances in ...) and tutorials providing the necessary basis on a particular subject (crash course), mix that we hope will be found attractive by the participants.

We remind tutorial attendees that handouts will not be provided at the conference venue, neither in printed form nor in electronic form (e.g., USB stick). Please download and eventually print at your convenience the material related to your tutorial(s) prior to the conference. You will also be able to download from the conference venue at the last minute. We however encourage you to download before the d-day to avoid slow network access at the conference venue.

Schedule

Morning tutorials 9h00 – 12h20
T1 Spectrogram reading in French and English
T2 Recent advances in incremental spoken language processing
T3

What speech researchers should know about video technology!

Afternoon tutorials 14h00 - 17h20
T4 Identification and modification of consonant perceptual cues
T5 Recent advances on large vocabulary speech recognition
T6 Forensic automatic speaker recognition


Detailed program


T1 – Spectrogram reading in French and English: language-dependent and independant acoustic cues to phonological features, coarticulation and influence of prosodic position

   Jacqueline Vaissière, jacqueline.vaissiere@univ-paris3.fr, CNRS, Université Sorbonne Nouvelle, France

Interspeech 2013 introduces the notion of tutorials targeting essential basic skills in speech processing. Spectrogram reading remains a key enabler to better understand and model speech and serves all research areas in speech communication. The course features an exceptional program, encompassing several languages (and language comparison), segmental and suprasegmental aspects as well as articulatory to acoustics aspects. Jacqueline Vaissière, one of the few multilingual spectrogram reader, in the tradition of Victor Zue and Ken Stevens, will illustrate the tutorial with many practical examples and provide guidance for a practical session in the afternoon.

Summary.  The goal of this tutorial is to give participants the key to the identification of the temporal and spectral cues that allow for the decoding of speech materials in English and French. These two languages constitute a good introduction to the spectrographic diversity of the world’s languages, because the phonetics/phonology of English differs greatly from that of French. The sounds that are transcribed with the same IPA symbols in both languages, such as /i(:)/, /u(:)/, /l/, actually have different acoustic « canonical » F-patterns. (The notion of F-pattern is central to the tutorial, which will recapitulate Fant’s seminal insights: the Acoustic Theory of Speech Production.) Moreover, the direction and extent of coarticulation are determined by prosodic structure – syllable, morpheme, word boundaries, and stress(es) –, which again differs widely in the two example languages. From the fine-grained parallel investigation of examples from real and synthesized speech, there emerge two sets of acoustic cues, the one language-independent and the other language-specific. On the basis of the knowledge of the canonical place of constriction(s) for a given sound (phoneme), the method taught in this tutorial allows for the interpretation of the shape of formant transitions between a vowel and a consonant, and finally for the identification of sounds. This tutorial is strongly recommended for speech scientists who wish to tap the considerable potential of acoustical/articulatory modeling, building on tools such as Maeda’s articulatory model.

Note. Practices in spectrogram reading, under the supervision of the tutorielist, will be organized during the afternoon, in parallel with the afternoon tutorials. Speech samples will be made available for download before the conference, along with practical details (e.g., software to install). Note that interested people must bring their own computer.

→ Download handouts
 



T2 – Recent Advances in Incremental Spoken Language Processing

   Timo Baumann, baumann@informatik.uni-hamburg.de, Universität Hamburg, Germany
   David Schlangen, david.schlangen@uni-bielefeld.de,Universität Bielefeld,Germany

Though a crucial issue in many real-world applications, incremental spoken language processing is rarely discussed and presented in tutorials. This tutorial shall provide a complete tour of spoken dialogue components, from ASR to synthesis, and of the issues one faces in making those incremental. The tutorial also advertises a platform dedicated to the rapid development of dialogue systems which might be beneficial to the whole speech community. This tutorial is nicely complemented with T5 on large vocabulary speech recognition.

Summary. Incremental processing – that is, the processing of user input while it is still ongoing, and the preparation of possibly concurrent system reactions – is about to make the move from a research challenge to being deployed and beneficial for users (e.g., in Google Voice Search). Beyond speeding up the presentation of results of voice search, it offers the potential for creating spoken dialogue systems with a much more natural behavior, with respect to turn-taking or production and understanding of feedback utterances.  In this tutorial, we will discuss the challenges posed by incremental spoken language processing, present the state of the art in incremental processing, and in particular will describe and demonstrate a framework for incremental processing that we have implemented over the last years which offers an architecture for creating systems by connecting modules, and contains reference implementations of the full chain of modules: ASR, NLU, Dialogue Management, Action Selection, Natural Language Generation, Speech Synthesis. The tutorial is targeted at researchers interested in incremental processing in general, and in particular at researchers who are interested in ``incrementalizing'' and want to be able to quickly realize end-to-end systems within which to test their modules.

→ Download handouts



T3 – What speech researchers should know about video technology!

   Koichi Shinoda, shinoda@cs.titech.ac.jp, Tokyo Institute of Technology, Japan
   Florian Metze, fmetze@cs.cmu.edu, Carnegie Mellon University, USA

The synergy between multimedia video and speech technology has long existed and is now growing at a fast pace. However, the two domains are still poorly connected. We are very glad to program this tutorial, given by persons at the frontier between the multimedia and speech communities, which intends to bridge this gap by providing the basics in video processing in lights of speech and audio knowledge. This tutorial is also strongly related to the Interspeech satellite workshop “Speech, Language and Audio in Multimedia” (SLAM 2013).

Summary. Thousands of videos are constantly being uploaded to the web, creating a vast resource, and an ever-growing demand for methods to make them easier to index, search, and retrieve. While visual information is a very important part of a video, acoustic and speech information often complements it. State of the art "content-based video retrieval" (CBVR) research faces several challenges: how to robustly and efficiently process large amounts of data, how to train classifiers and segmenters on unlabeled data, how to represent and then fuse information across modalities, how to include human feedback, etc. Thanks to the advancement of computation technology, many of the statistical approaches we originally developed for speech processing can now be readily used for CBVR. This tutorial aims to present to the speech community the state of the art in video processing, by discussing the most relevant tasks at NIST's TREC Video Retrieval Evaluation (TRECVID) evaluation and workshop series (http://trecvid.nist.gov/) We liken TRECVID's "Semantic Indexing" (SIN) task, in which a system must identify occurrences of concepts such as "desk", or "dancing" in a video to the word spotting approach. We then proceed to explain more recent, and challenging tasks, such as "Multimedia Event Detection" (MED), and "Multimedia Event Recounting" (MER), which can be compared to meeting transcription and summarization tasks in the speech area. We will then proceed to lay out how the speech and language community can contribute to this work, given its own vast body of experience, and identify opportunities for advancing speech-centric research on these datasets, whose large scale and multi-modal nature pose unique challenges and opportunities for future research.

→ More details

→ Download handouts



T4 – The identification and modification of consonant perceptual cues in natural speech

   Andrea Trevino Carolina, atrevin2@illinois.edu, University of Illinois Champaign-Urbana, USA
   Jont Allen, jontalle@illinois.edu, University of Illinois Champaign-Urbana, USA
   Feipeng Li, fpli75@gmail.com, Johns Hopkins University, USA

Our knowledge on speech perception relies on an increasing set of experimental and theoretical evidences, but identifying in the speech signal the cues supporting the cognitive representations is still an open issue, in line with the focus of the conference. Signal modification provides a powerful tool to assess the relative importance and robustness of each cue across languages and noise conditions. This tutorial will enable researchers to manipulate the cues involved in consonant perception, pointing towards a large scope of applications in psychoacoustics, phonetics, phonology, automatic speech recognition, and hearing aid development.

Summary. The focus of this tutorial is the identification and manipulation of consonant cues in natural speech.  Humans are able to recognize naturally-spoken consonants with a remarkable accuracy despite the huge amount of existing signal variability. Understanding which cues humans use to decode these variable consonants is fundamental to any research involving natural speech. We will review the history of speech cue research, including the work at Haskins Labs using samples of synthetic speech, and corresponding testing of early perceptual cue hypotheses using naturally-spoken speech samples. This will be followed by a discussion of the development of modern speech cue concepts and current approaches to analysis. The HSR research group at UIUC has developed a psychoacoustic approach to identifying and determining the roles of consonant cues in naturally-spoken speech.  We will show how data from three basic psychoacoustic experiments can be used to isolate the necessary and sufficient consonant cue region in natural, variable speech. This tutorial will provide an understanding of what acoustic components humans use to decode natural speech, and how these components can be modified to affect perception.

→ More details

→ Download handouts



T5 – Recent Advances in Large Vocabulary Continuous Speech Recognition

   George Saon, gsaon@us.ibm.com, IBM T. J. Watson Research Center, USA
   Jen-Tzung Chien, jtchien@nctu.edu.tw, National Chiao Tung University, Taiwan

Large vocabulary speech recognition has undergone deep changes in the past few years with the introduction of new paradigms such as discriminative transforms and deep neural networks. While there has been many tutorials on deep neural networks recently, we feel there was a lack of global overview on recent advances in the field. This tutorial will provide such an overview and nicely pairs with T2 on incremental speech processing.

Summary. In this tutorial, we will present some state-of-the-art techniques for large vocabulary continuous speech recognition (LVCSR). This tutorial  will cover different components in LVCSR including front-end processing, acoustic modeling, language modeling and hypothesis search and system combination. In the front-end processing module, we will introduce popular feature extraction methods and transformations and discuss speaker-adaptive and discriminative features. In the acoustic modeling section, we will present feature-space and model-space discriminative training and speaker adaptation. A series of discriminative estimation methods will also be compared. Additionally, we will report some recent progress on the use of deep neural network acoustic models. In language modeling (LM), we will address Bayesian learning to deal with the issues of insufficient training data, large-span modeling and model regularization. High-performance LMs will be presented. In hypothesis search, we will present state-of-the-art search algorithms and system combination methods. Additionally, we will point out some new trends for LVCSR including structural state model, basis representation, model regularization and deep belief networks.

→ More details

→ Download handouts



T6 – Forensic Automatic Speaker Recognition: Theory, Implementation and Practice

   Andrzej Drygajlo, andrzej.drygajlo@epfl.ch, Federal Institute of Technology Lausanne, Switzerland

The use of speech technology for forensic speaker recognition tends to become more and more frequent in police investigations, legal proceedings or court cases. This evolution is a consequence of technological progress in the field of speech science but it is raising ethical issues (which are regularly addressed in the scientific community) and societal confusion: Indeed, voice is often presented erroneously (in works of fiction, in press articles, etc...) as a means of personal identification which would be as powerful as fingerprints or DNA, thus creating misinterpretations of the potential and limits of the current technology. In this context, we are happy to schedule a tutorial on forensic speaker recognition which will present an overview of the most recent concepts, technologies, methodologies and common practice in forensic speaker recognition by one of the leading specialist in the domain. Note that T1 appears as a nice introduction to this tutorial.

Summary. Forensic speaker recognition (FSR) is the process of determining if a specific individual (suspected speaker) is the source of a questioned voice recording (trace). Results of FSR based investigations may be of pivotal importance at any stage of the course of justice, be it the very first police investigation or a court trial. Forensic speaker recognition has proven an effective tool in the fight against crime, yet there is a constant need for more research due to the difficulties involved because of the within-speaker (within-source) variability, between-speakers (between-sources) variability, and differences in recording sessions conditions of digital communication networks. Additional difficulty lies in adapting automatic methods of speaker recognition to the forensic methodology that provides a coherent way of getting quick cash and assessing and presenting recorded speech as scientific evidence. This tutorial aims at introducing deterministic and statistical forensic automatic speaker recognition (FASR) methods that provide a coherent way of quantifying and presenting recorded voice as biometric evidence, as well as the assessment of its strength (likelihood ratio) in the Bayesian interpretation framework, including scoring and direct methods, compatible with interpretations in other forensic disciplines (e.g., DNA). It also aims at presenting data-driven tools and methodology for assisting in evaluative and investigative capacities of forensic scientists when voice recordings are acquired. The tutorial will show universal guidelines for the calculation of the biometric evidence and its strength taking into account the models of the within-source variability of the suspected speaker and the between-sources variability of the relevant population given the questioned recording. It will also introduce explicit inter-session variability compensation techniques to combat the degradations caused by adverse conditions and speech variability.

→ More details

→ Download handouts