Curriculum Vitaes

Takahashi Toru

  (高橋 徹)

Profile Information

Affiliation
Professor, Faculty of Design Technology, Department of Information Systems Engineering, Osaka Sangyo University
Degree
博士(工学)(名古屋工業大学)

Researcher number
30419494
J-GLOBAL ID
201201026236304402
researchmap Member ID
7000000887

External link

Papers

 115

Misc.

 71
  • 高橋徹, 山田耕嗣
    大阪産業大学論文集, 自然科学編, 128(128) 31-40, Mar, 2017  
  • TAKAHASHI Toru, NOSE Kazuo, TSUKAMOTO Naoyuki, YOSHIKAWA Koji
    IEICE technical report. Welfare Information technology, 114(357) 57-62, Dec 11, 2014  
    This paper describes development and evaluation of a notification system of tram position by using global positioning system. We also show a design concept of the system. The most important point of the concept is that the system is constructed from easily acquirable and general purpose equipments since we intend to promote a use of location notification system in other transportations, such as bus, taxi, and train. A prototype system based on the concept is developed. It is evaluated on Hankai tram in service. We test two map matching algorithm to reduce estimation error and optimize length between anchor points. We found that a suitable period of measuring location is 1 or 2 seconds. Experimental results show that an expected total delay for showing location is 3 second and maximum error of the location is 100m. It is confirmed that we can construct a location notification system from easily acquirable and general purpose equipments.
  • 阿曽 慎平, 齋藤 毅, 後藤 真孝, 糸山 克寿, 高橋 徹, 尾形 哲也, 奥乃 博
    研究報告音楽情報科学(MUS), 2012(13) 1-8, Jan 27, 2012  
    本稿では,歌声と朗読音声を識別するシステムについて述べる.入力は無雑音音声,出力は歌声と朗読音声それぞれの尤度 (連続値) である.従来,スペクトル包絡 (MFCC) と基本周波数 (F0) の時間変化に基づいた識別システムが報告されている.これらの特徴量に基づく識別器に,スペクトル変化量のピーク間隔という,音素継続時間に関連する特徴量に基づく識別器を加え,入力音声長に応じて各識別器への重みを変化させた.実験の結果,従来システムでは1秒の音声に対し 86.7% の精度であったのに対し,本システムでは 90.2% という結果を得た.本システムが実時間で動作するデモアプリケーションについても述べる.In this paper we describe a system that discriminates between singing and speaking voices. Given a clean speech signal, it outputs the likelihood of each of the singing and speaking voices. Previous systems use temporal transition of spectral envelope (MFCC) and fundamental frequency (F0) as discrimina- tion features. Our system adds peak interval of spectral change as a phoneme duration feature and weights these features according to the duration of the input speech signal. Experimental results with one-second speech signal show that our system achieves 90.2 % accuracy compared to 86.7 % with previous systems. We also describe a real-time application demonstrating our system.
  • Kohei Nagira, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7191 388-396, 2012  Peer-reviewed
    We present a method of blind source separation (BSS) for speech signals using a complex extension of infinite sparse factor analysis (ISFA) in the frequency domain. Our method is robust against delayed signals that usually occur in real environments, such as reflections, short-time reverberations, and time lags of signals arriving at microphones. ISFA is a conventional non-parametric Bayesian method of BSS, which has only been applied to time domain signals because it can only deal with real signals. Our method uses complex normal distributions to estimate source signals and mixing matrix. Experimental results indicate that our method outperforms the conventional ISFA in the average signal-to-distortion ratio (SDR). © 2012 Springer-Verlag.
  • Yasuharu Hirasawa, Naoki Yasuraoka, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7191 446-453, 2012  Peer-reviewed
    This paper focuses on blind speech separation in under-determined conditions, that is, in the case when there are more sound sources than microphones. We introduce a sound source model based on the Gaussian mixture model (GMM) to represent a speech signal in the time-frequency domain, and derive rules for updating the model parameters using the auxiliary function method. Our GMM sound source model consists of two kinds of Gaussians: sharp ones representing harmonic parts and smooth ones representing nonharmonic parts. Experimental results reveal that our method outperforms the method based on non-negative matrix factorization (NMF) by 0.7dB in the signal-to-distortion ratio (SDR), and by 1.7dB in the signal-to-interference ratio (SIR). This means that our method effectively removes interference coming from other talkers. © 2012 Springer-Verlag.
  • 駒谷和範, 松山匡子, 武田龍, 高橋徹, 尾形哲也, 奥乃博
    情報処理学会論文誌ジャーナル(CD-ROM), 52(12) 3374-3385, Dec 15, 2011  
  • 糸原達彦, 大塚琢馬, 水本武志, 高橋徹, 尾形哲也, 奥乃博
    全国大会講演論文集, 2011(1) 235-237, Mar 2, 2011  
    合奏において、ビートトラッキングは動作タイミングの取得の基礎となる技術である。ギターとの合奏において、ビートトラッキングは演奏テンポの揺らぎや裏拍ビートを含む多様なリズムへの頑健性、つまり(1)テンポと(2)音符長の両方の変動に対する追従性が要求される。しかし従来の手法では両立できなかった。本研究では視聴覚情報統合により、両者の変動追従性向上を実現する。(1)の問題にはSTPMという聴覚情報を用いた手法を適用する。(2)の問題はギター演奏動作の周期性を利用し手の位置情報を取得、それとSTPMで得られる信頼度関数とに粒子フィルタを適用することで解決する。
  • 山川暢英, 高橋徹, 北原鉄朗, 尾形哲也, 奥乃博
    情報処理学会全国大会講演論文集, 73rd(2) 2.113-2.114-114, Mar 2, 2011  
  • Takeshi Mizumoto, Kazuhiro Nakadai, Takami Yoshida, Ryu Takeda, Takuma Otsuka, Toru Takahashi, Hiroshi G. Okuno
    2011 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2130-2137, 2011  Peer-reviewed
    This paper presents the design and implementation of selectable sound separation functions on the telepresence system "Texai" using the robot audition software "HARK." An operator of Texai can "walk" around a faraway office to attend a meeting or talk with people through video-conference instead of meeting in person. With a normal microphone, the operator has difficulty recognizing the auditory scene of the Texai, e.g., he/she cannot know the number and the locations of sounds. To solve this problem, we design selectable sound separation functions with 8 microphones in two modes, overview and filter modes, and implement them using HARK's sound source localization and separation. The overview mode visualizes the direction-of-arrival of surrounding sounds, while the filter mode provides sounds that originate from the range of directions he/she specifies. The functions enable the operator to be aware of a sound even if it comes from behind the Texai, and to concentrate on a particular sound. The design and implementation was completed in five days due to the portability of HARK. Experimental evaluations with actual and simulated data show that the resulting system localizes sound sources with a tolerance of 5 degrees.
  • Nobuhide Yamakawa, Toru Takahashi, Tetsuro Kitahara, Tetsuya Ogata, Hiroshi G. Okuno
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6704(2) 1-10, 2011  Peer-reviewed
    Our goal is to achieve a robot audition system that is capable of recognizing multiple environmental sounds and making use of them in human-robot interaction. The main problems in environmental sound recognition in robot audition are: (1) recognition under a large amount of background noise including the noise from the robot itself, and (2) the necessity of robust feature extraction against spectrum distortion due to separation of multiple sound sources. This paper presents the environmental recognition of two sound sources fired simultaneously using matching pursuit (MP) with the Gabor wavelet, which extracts salient audio features from a signal. The two environmental sounds come from different directions, and they are localized by multiple signal classification and, using their geometric information, separated by geometric source separation with the aid of measured head-related transfer functions. The experimental results show the noise-robustness of MP although the performance depends on the properties of the sound sources. © 2011 Springer-Verlag.
  • Yasuharu Hirasawa, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6703(1) 348-358, 2011  Peer-reviewed
    In real-world situations, people often hear more than two simultaneous sounds. For robots, when the number of sound sources exceeds that of sensors, the situation is called under-determined, and robots with two ears need to deal with this situation. Some studies on under-determined sound source separation use L1-norm minimization methods, but the performance of automatic speech recognition with separated speech signals is poor due to its spectral distortion. In this paper, a two-stage separation method to improve separation quality with low computational cost is presented. The first stage uses a L1-norm minimization method in order to extract the harmonic structures. The second stage exploits reliable harmonic structures to maintain acoustic features. Experiments that simulate three utterances recorded by two microphones in an anechoic chamber show that our method improves speech recognition correctness by about three points and is fast enough for real-time separation. © 2011 Springer-Verlag.
  • Yang Zhang, Shun Nishide, Toru Takahashi, Hiroshi G. Okuno, Tetsuya Ogata
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2011, PT I, 6791 167-175, 2011  Peer-reviewed
    Our goal is to develop a system that is able to learn and classify environmental sounds for robots working in the real world. In the real world, two main restrictions pertain in learning. First, the system has to learn using only a small amount of data in a limited time because of hardware restrictions. Second, it has to adapt to unknown data since it is virtually impossible to collect samples of all environmental sounds. We used a neuro-dynamical model to build a prediction and classification system which can self-organize sound classes into parameters by learning samples. The proposed system searches space of parameters for classifying. In the experiment, we evaluated the accuracy of classification for known and unknown sound classes.
  • Yasuharu Hirasawa, Naoki Yasuraoka, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 1756-1759, 2011  Peer-reviewed
    This paper presents an efficient algorithm to solve Lp-norm minimization problem for under-determined speech separation; that is, for the case that there are more sound sources than microphones. We employ an auxiliary function method in order to derive update rules under the assumption that the amplitude of each sound source follows generalized Gaussian distribution. Experiments reveal that our method solves the L1-norm minimization problem ten times faster than a general solver, and also solves Lp-norm minimization problem efficiently, especially when the parameter p is small; when p is not more than 0.7, it runs in real-time without loss of separation quality.
  • Hiromitsu Awano, Shun Nishide, Hiroaki Arie, Jun Tani, Toru Takahashi, Hiroshi G. Okuno, Tetsuya Ogata
    NEURAL INFORMATION PROCESSING, PT III, 7064 323-+, 2011  Peer-reviewed
    The objective of our study is to find out how a sparse structure affects the performance of a recurrent neural network (RNN). Only a few existing studies have dealt with the sparse structure of RNN with learning like Back Propagation Through Time (BPTT). In this paper, we propose a RNN with sparse connection and BPTT called Multiple time scale RNN (MTRNN). Then, we investigated how sparse connection affects generalization performance and noise robustness. In the experiments using data composed of alphabetic sequences, the MTRNN showed the best generalization performance when the connection rate was 40%. We also measured sparseness of neural activity and found out that sparseness of neural activity corresponds to generalization performance. These results means that sparse connection improved learning performance and sparseness of neural activity would be used as metrics of generalization performance.
  • Yang Zhang, Tetsuya Ogata, Shun Nishide, Toru Takahashi, Hiroshi G. Okuno
    in Proc. of Joint 5th Int. Conf. on Soft Computing and Intelligent Systems and 11th International Symposium on advanced Intelligent Systems (SCIS & ISIS 2010), 378-383, Dec, 2010  Peer-reviewed
    This paper describes our method to classify nonspeech environmental sounds for robots working. In the real world, two main restrictions pertain in learning. First, robots have to learn using only a small amount of sounds in a limited time and space because of restrictions. Second, it has to detect unknown sounds to avoid false classification since it is virtually impossible to collect samples of all environmental sounds. Most of the previous methods require a huge number of samples of all target sounds, including noises, for training stochastic models such as the Gaussian mixture model. In contrast, we use a neurodynamical model to build a prediction and classification system. The neuro-dynamical system can be trained with a small amount of sounds and generalize others by inferring the sound generation dynamics. After training, a self-organized space is structured for the sound generation dynamics. The proposed system classify on the basis of the self-organized space. The prediction results of sounds are used for determining unknown sounds in our system. In this paper, we show the results of preliminary experiments on the proposed model's classification of known and unknown sound classes.
  • Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of IEEE/RSJ-2010 Workshop on Robots and Musical Expression,CD-ROM, Oct, 2010  Peer-reviewed
  • Takeshi Mizumoto, Angelica Lim, Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of IEEE/RSJ-2010 Workshop on Robots and Musical Expression,CD-ROM, 159-171, Oct, 2010  Peer-reviewed
  • Angelica Lim, Takeshi Mizumoto, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of IEEE/RSJ-2010 Workshop on Robots and Musical Expression,CD-ROM, Oct, 2010  Peer-reviewed
  • Shinpei Aso, Takuya Saitou, Masataka Goto, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of the 13th International Conference on Digital Audio Effects (DAFx-10), Sep, 2010  Peer-reviewed
    This paper describes a singing-to-speaking synthesis system called "SpeakBySinging" that can synthesize a speaking voice from an input singing voice and the song lyrics. The system controls three acoustic features that determine the difference between speaking and singing voices: the fundamental frequency (F0), phoneme duration, and power (volume). By changing these features of a singing voice, the system synthesizes a speaking voice while retaining the timbre of the singing voice. The system first analyzes the singing voice to extract the F0 contour, the duration of each phoneme of the lyrics, and the power. These features are then converted to target values that are obtained by feeding the lyrics into a traditional text-to-speech (TTS) system. The system finally generates a speaking voice that preserves the timbre of the singing voice but has speech-like features. Experimental results show that SpeakBySinging can convert singing voices into speaking voices whose timbre is almost the same as the original singing voices.
  • Okuno Hiroshi G, Nakadai Kazuhiro, Takahashi Toru
    Proceedings of the Society Conference of IEICE, 2010 "SS-72"-"SS-73", Aug 31, 2010  
  • Akira Maezawa, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of 11th International Conference on Musical Information Retreival (ISMIR-2010), Aug, 2010  Peer-reviewed
  • YASURAOKA NAOKI, ITOYAMA KATSUTOSHI, YOSHIOKA TAKUYA, TAKAHASHI TORU, KOMATANI KAZUNORI, OGATA TETSUYA, OKUNO HIROSHI G
    研究報告音楽情報科学(MUS), 2010(20) 1-8, Jul 21, 2010  
    フレーズ置換とは,多重奏音響信号から特定パート演奏をユーザー指定の別楽譜による演奏に差し替えるものである.これは,1) 元々のフレーズ演奏成分を除去する音源分離の課題と,2)元演奏の音色や演奏表情を新しい演奏上で再現する演奏合成の課題からなる.我々は調波非調波Gaussian Mixture Model (GMM) による置換対象演奏モデルとNonnegative Matrix Factorizationによる伴奏モデルを用いて音源分離を行い,同時に調波非調波GMMから得た基本周波数,倍音強度などの音響特徴を新しい演奏楽譜のMIDI音源音響信号に転写することで元演奏の音響特性を持つ新しい演奏を合成する.本フレーズ置換法に対し1) 元の演奏が正しく除去されるか,2) 新しい演奏は元演奏の特徴を保持しているか,の2点を客観評価し,提案法の有効性を示す.This paper presents a music manipulating system that enables a user to replace an instrument performance phrase in polyphonic audio mixture. Two technical problems must be solved to realize this system: 1)separating the melody part from accompaniment, and 2)synthesizing a new instrument performance that has timbre and expression of the original one. Our method first performs the separation using statistical model integrating harmonic and inharmonic Gaussian mixture and nonnegative-matrix-factorization. Then our method synthesizes a new instrument performance by adding the acoustic characteristics given by Gaussian mixture parameters to a MIDI synthesizer-generated sound. Two evaluations confirm the effectiveness of the proposed method.
  • MAEZAWA Akira, GOTO Masataka, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 143-144, Mar 8, 2010  
  • YASURAOKA Naoki, ITOYAMA Katsutoshi, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 183-184, Mar 8, 2010  
  • LIM Angelica, MIZUMOTO Takeshi, OTSUKA Takuma, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    Proc. National Convention of the Information Processing Society, Tokyo, 2010, 72 201-202, Mar 8, 2010  
  • MIZUMOTO Takeshi, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 203-204, Mar 8, 2010  
  • HIRASAWA Yasuharu, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 253-254, Mar 8, 2010  
  • YAMAKAWA Nobuhide, KITAHARA Tetsuro, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 257-258, Mar 8, 2010  
  • AKIYAMA Soramichi, KOMATANI Kazunori, TAKAHASHI Toru, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 291-292, Mar 8, 2010  
  • ASO Shinpei, SAITOU Takeshi, GOTO Masataka, ITOYAMA Katsutoshi, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 295-296, Mar 8, 2010  
  • AWANO Hiromitsu, OGATA Tetsuya, TAKAHASHI Toru, KOMATANI Kazunori, OKUNO Hiroshi G
    全国大会講演論文集, 72 395-396, Mar 8, 2010  
  • HINOSHITA Wataru, ARIE Hiroaki, TANI Jun, OGATA Tetsuya, TAKAHASHI Toru, KOMATANI Kazunori, OKUNO Hiroshi G
    全国大会講演論文集, 72 525-526, Mar 8, 2010  
  • TAKEDA Ryu, NAKADAI Kazuhiro, TAKAHASHI Toru, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 27-28, Mar 8, 2010  
  • TAKAHASHI Toru, NAKADAI Kazuhiro, KOMATANI Kazunori, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 29-30, Mar 8, 2010  
  • MATSUYAMA Kyoko, KOMATANI Kazunori, TAKAHASHI Toru, OGATA Tetsuya, OKUNO Hiroshi G
    全国大会講演論文集, 72 129-130, Mar 8, 2010  
  • 山川暢英, 高橋徹, 北原鉄朗, 尾形哲也, 奥乃博
    日本ロボット学会学術講演会予稿集(CD-ROM), 28th ROMBUNNO.1H2-4, 2010  
  • Toru Takahashi, Kazuhiro Nakadai, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    2010 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 470-475, 2010  Peer-reviewed
    This paper describes improvement of sound source separation for a simultaneous automatic speech recognition (ASR) system of a humanoid robot. A recognition error in the system is caused by a separation error and interferences of other sources. In separability, an original geometric source separation (GSS) is improved. Our GSS uses a measured robot's head related transfer function (HRTF) to estimate a separation matrix. As an original GSS uses a simulated HRTF calculated based on a distance between microphone and sound source, there is a large mismatch between the simulated and the measured transfer functions. The mismatch causes a severe degradation of recognition performance. Faster convergence speed of separation matrix reduces separation error. Our approach gives a nearer initial separation matrix based on a measured transfer function from an optimal separation matrix than a simulated one. As a result, we expect that our GSS improves the convergence speed. Our GSS is also able to handle an adaptive step-size parameter. These new features are added into open source robot audition software (OSS) called "HARK" which is newly updated as version 1.0.0. The HARK has been installed on a HRP-2 humanoid with an 8-element microphone array. The listening capability of HRP-2 is evaluated by recognizing a target speech signal which is separated from a simultaneous speech signal by three talkers. The word correct rate (WCR) of ASR improves by 5 points under normal acoustic environments and by 10 points under noisy environments. Experimental results show that HARK 1.0.0 improves the robustness against noises.
  • Ryu Takeda, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    2010 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 4366-4371, 2010  Peer-reviewed
    This paper presents the upper-limit evaluation of robot audition based on ICA-BSS in multi-source, barge-in and highly reverberant conditions. The goal is that the robot can automatically distinguish a target speech from its own speech and other sound sources in a reverberant environment. We focus on the multi-channel semi-blind ICA (MCSB-ICA), which is one of the sound source separation methods with a microphone array, to achieve such an audition system because it can separate sound source signals including reverberations with few assumptions on environments. The evaluation of MCSB-ICA has been limited to robot's speech separation and reverberation separation. In this paper, we evaluate MCSB-ICA extensively by applying it to multi-source separation problems under common reverberant environments. Experimental results prove that MCSB-ICA outperforms conventional ICA by 30 points in automatic speech recognition performance.
  • Takuma Otsuka, Takeshi Mizumoto, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT I, PROCEEDINGS, 6096 102-+, 2010  Peer-reviewed
    Our goal is to achieve a musical ensemble among a robot and human musicians where the robot listens to the music with its own microphones The maul issues ale (7) robust heat-tracking since the robot hears its own generated sounds in addition to the accompanied and (2) robust synchronizing it performance with the accompanied music even if humans' musical performance fluctuates This paper presents a. music-ensemble Therein mist robot implemented on the humanoid HRP-2 with the following three functions (1) self-generated Theremin sound suppression by semi-blind Independent Component Analysis, (2) beat tracking robust against, tempo fluctuation in humans' performance, and (3) feedforward control of Theremin pitch Experimental results with a. human drummer show the capability of this robot for the adaptation to the temporal fluctuation in his performance
  • Kyoko Matsuyama, Kazunori Komatani, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT II, PROCEEDINGS, 6097 585-594, 2010  Peer-reviewed
    We describe a novel dialogue strategy enabling robust, interaction under noisy environments where automatic speech recognition (ASR) results ate not, necessarily tellable We have developed method that exploits utterance timing together with ASR. results to interpret, user intention, that, IS to identify one item that a user wants to indicate from system The timing of utterances containing icier:laud expressors is approximated by Gamma distribution which is integrated with ASK results by expressing of them as probabilities lit this paper, we improve the identification accuracy by extending the method First we enable interpretation of utterances including ordinal numbers, which appear several tunes in out data collected from users Then we use proper acoustic models and parameters; improving the identification accuracy by 4 0% in total We also show that Latent Semantic Mapping enables mole expressions to be handled in our framework
  • Akira Maezawa, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsnya Ogata, Hiroshi C. Okuno
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT III, PROCEEDINGS, 6098 249-259, 2010  Peer-reviewed
    This work presents an automated violin fingering estimation method that facilitates a student violinist acquire the "sound" of his/her favorite recording artist created by the artist's unique fingering. Our method realizes this by analyzing an audio recording played by the artist, and recuperating the most playable fingering that recreates the aural characteristics of the recording. Recovering the aural characteristics requires the bowed string estimation of an audio recording, and using the estimated result for optimal fingering decision. The former requires high accuracy mid robustness against the use of different violins or brand of strings; and the latter needs to create a natural fingering for the violinist. We solve the first problem by detecting estimation errors using rule-based algorithms, and by adapting the estimator to the recording based on mean normalization. We solve the second problem by incorporating; in addition to generic stringed-instrument model used in existing studies; a fingering model that is based on pedagogical practices of violin playing; defined OH a. sequence of two or three notes. The accuracy of the bowed string estimator improved by 21 points in a realistic situation (38% - 59%) by incorporating error correction and mean normalization. Subjective evaluation of the optimal fingering decision algorithm by seven violinists on 22 musical excerpts showed that compared to the model used in existing studies, our proposed model was preferred over existing one (p = 0.01) but no significant preference towards proposed method defined on sequence of two notes versus three notes was observed (p = 0.05).
  • Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 1238-1244, 2010  Peer-reviewed
    Our goal is to develop an interactive music robot, i.e., a robot that presents a musical expression together with humans. A music interaction requires two important functions: synchronization with the music and musical expression, such as singing and dancing. Many instrument-performing robots are only capable of the latter function, they may have difficulty in playing live with human performers. The synchronization function is critical for the interaction. We classify synchronization and musical expression into two levels: (1) the rhythm level and (2) the melody level. Two issues in achieving two-layer synchronization and musical expression are: (1) simultaneous estimation of the rhythm structure and the current part of the music and (2) derivation of the estimation confidence to switch behavior between the rhythm level and the melody level. This paper presents a score following algorithm, incremental audio to score alignment, that conforms to the two-level synchronization design using a particle filter. Our method estimates the score position for the melody level and the tempo for the rhythm level. The reliability of the score position estimation is extracted from the probability distribution of the score position. Experiments are carried out using polyphonic jazz songs. The results confirm that our method switches levels in accordance with the difficulty of the score estimation. When the tempo of the music is less than 120 (beats per minute; bpm), the estimated score positions are accurate and reported; when the tempo is over 120 (bpm), the system tends to report only the tempo to suppress the error in the reported score position predictions.
  • Hideki Kawahara, Masanori Morise, Toru Takahashi, Hideki Banno, Ryuichi Nisimura, Toshio Irino
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 38-+, 2010  Peer-reviewed
    A systematic framework for non-periodic excitation source representation is proposed for high-quality speech manipulation systems such as TANDEM-STRAIGHT, which is basically a channel VOCODER. The proposed method consists of two subsystems for non-periodic components; a colored noise source and an event analyzer/generator. The colored noise source is represented by using a sigmoid model with non-linear level conversion. Two model parameters, boundary frequency and slope parameters, are estimated based on pitch range linear prediction combined with F0 adaptive temporal axis warping and those on the original temporal axis. The event subsystem detects events based on kurtosis of filtered speech signals. The proposed framework provides significant quality improvement for high-quality recorded speech materials.
  • Kyoko Matsuyama, Kazunori Komatani, Ryu Takeda, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 3050-3053, 2010  Peer-reviewed
    In our barge-in-able spoken dialogue system, the user's behaviors such as barge-in timing and utterance expressions vary according to his/her characteristics and situations. The system adapts to the behaviors by modeling them. We analyzed 1584 utterances collected by our systems of quiz and news-listing tasks and showed that ratio of using referential expressions depends on individual users and average lengths of listed items. This tendency was incorporated as a prior probability into our method and improved the identification accuracy of the user's intended items.
  • Nobuhide Yamakawa, Tetsuro Kitahara, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2342-+, 2010  Peer-reviewed
    Research on environmental sound recognition has not shown great development in comparison with that on speech and musical signals. One of the reasons is that the sound category of environmental sounds covers a broad range of acoustical natures. We classified them in order to explore suitable recognition techniques for each characteristic. We focus on impulsive sounds and their non-stationary feature within and between analytic frames. We used matching-pursuit as a framework to use wavelet analysis for extracting temporal variation of audio features inside a frame. We also investigated the validity of modeling decaying patterns of sounds using Hidden markov models. Experimental results indicate that sounds with multiple impulsive signals are recognized better by using time-frequency analyzing bases than by frequency domain analysis. Classification of sound classes with a long and clear decaying pattern improves when HMMs with multiple number of hidden states are applied.
  • Hiromitsu Awano, Tetsuya Ogata, Shun Nishide, Torn Takahashi, Kazunori Komatani, Hiroshi G. Okuno
    IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010), 2010  Peer-reviewed
    The objective of our study was to develop dynamic collaboration between a human and a robot. Most conventional studies have created pre-designed rule-based collaboration systems to determine the timing and behavior of robots to participate in tasks. Our aim is to introduce the confidence of the task as a criterion for robots to determine their timing and behavior. In this paper, we report the effectiveness of applying reproduction accuracy as a measure for quantitatively evaluating confidence in an object arrangement task. Our method is comprised of three phases. First, we obtain human-robot interaction data through the Wizard of OZ method. Second, the obtained data are trained using a neuro-dynamical system, namely, the Multiple Time-scales Recurrent Neural Network (MTRNN). Finally, the prediction error in MTRNN is applied as a confidence measure to determine the robot's behavior. The robot participated in the task when its confidence was high, while it just observed when its confidence was low. Training data were acquired using an actual robot platform, Hiro. The method was evaluated using a robot simulator. The results revealed that motion trajectories could be precisely reproduced with a high degree of confidence, demonstrating the effectiveness of the method.
  • Ryu Takeda, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010), 1949-1956, 2010  Peer-reviewed
    This paper describes a speedup and performance improvement of multi-channel semi-blind ICA (MCSB-ICA) with parallel and resampling-based block-wise processing. MCSB-ICA is an integrated method of sound source separation that accomplishes blind source separation, blind dereverberation, and echo cancellation. This method enables robots to separate user's speech signals from observed signals including the robot's own speech, other speech and their reverberations without a priori information. The main problem when MCSB-ICA is applied to robot audition is its high computational cost. We tackle this by multi-threading programming, and the two main issues are 1) the design of parallel processing and 2) incremental implementation. These are solved by a) multiple-stack-based parallel implementation, and b) resampling-based overlaps and block-wise separation. The experimental results proved that our method reduced the real-time factor to less than 0.5 with an eight-core CPU, and it improves the performance of automatic speech recognition by 2-10 points compared with the single-stack-based parallel implementation without the resampling technique.

Books and Other Publications

 8

Presentations

 79

Teaching Experience

 18

Professional Memberships

 6

Works

 1

Research Projects

 14

研究テーマ

 1
  • 研究テーマ(英語)
    ヒューマンロボットインタラクション,音声コミュニケーション,音声認識,音環境理解,
    キーワード(英語)
    マイクロホンアレイ,音響特徴量,音声認識,音源定位,音源分離
    概要(英語)
    ロボットと人の自然な対話を実環境において実現するための課題に取り組んでいる