研究者業績

高橋 徹

タカハシ トオル  (Takahashi Toru)

基本情報

所属
大阪産業大学 デザイン工学部情報システム学科 教授
学位
博士(工学)(名古屋工業大学)

研究者番号
30419494
J-GLOBAL ID
201201026236304402
researchmap会員ID
7000000887

外部リンク

論文

 132

MISC

 109
  • Yasuharu Hirasawa, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6703(1) 348-358 2011年  査読有り
    In real-world situations, people often hear more than two simultaneous sounds. For robots, when the number of sound sources exceeds that of sensors, the situation is called under-determined, and robots with two ears need to deal with this situation. Some studies on under-determined sound source separation use L1-norm minimization methods, but the performance of automatic speech recognition with separated speech signals is poor due to its spectral distortion. In this paper, a two-stage separation method to improve separation quality with low computational cost is presented. The first stage uses a L1-norm minimization method in order to extract the harmonic structures. The second stage exploits reliable harmonic structures to maintain acoustic features. Experiments that simulate three utterances recorded by two microphones in an anechoic chamber show that our method improves speech recognition correctness by about three points and is fast enough for real-time separation. © 2011 Springer-Verlag.
  • Yang Zhang, Shun Nishide, Toru Takahashi, Hiroshi G. Okuno, Tetsuya Ogata
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2011, PT I 6791 167-175 2011年  査読有り
    Our goal is to develop a system that is able to learn and classify environmental sounds for robots working in the real world. In the real world, two main restrictions pertain in learning. First, the system has to learn using only a small amount of data in a limited time because of hardware restrictions. Second, it has to adapt to unknown data since it is virtually impossible to collect samples of all environmental sounds. We used a neuro-dynamical model to build a prediction and classification system which can self-organize sound classes into parameters by learning samples. The proposed system searches space of parameters for classifying. In the experiment, we evaluated the accuracy of classification for known and unknown sound classes.
  • Yasuharu Hirasawa, Naoki Yasuraoka, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5 1756-1759 2011年  査読有り
    This paper presents an efficient algorithm to solve Lp-norm minimization problem for under-determined speech separation; that is, for the case that there are more sound sources than microphones. We employ an auxiliary function method in order to derive update rules under the assumption that the amplitude of each sound source follows generalized Gaussian distribution. Experiments reveal that our method solves the L1-norm minimization problem ten times faster than a general solver, and also solves Lp-norm minimization problem efficiently, especially when the parameter p is small; when p is not more than 0.7, it runs in real-time without loss of separation quality.
  • Hiromitsu Awano, Shun Nishide, Hiroaki Arie, Jun Tani, Toru Takahashi, Hiroshi G. Okuno, Tetsuya Ogata
    NEURAL INFORMATION PROCESSING, PT III 7064 323-+ 2011年  査読有り
    The objective of our study is to find out how a sparse structure affects the performance of a recurrent neural network (RNN). Only a few existing studies have dealt with the sparse structure of RNN with learning like Back Propagation Through Time (BPTT). In this paper, we propose a RNN with sparse connection and BPTT called Multiple time scale RNN (MTRNN). Then, we investigated how sparse connection affects generalization performance and noise robustness. In the experiments using data composed of alphabetic sequences, the MTRNN showed the best generalization performance when the connection rate was 40%. We also measured sparseness of neural activity and found out that sparseness of neural activity corresponds to generalization performance. These results means that sparse connection improved learning performance and sparseness of neural activity would be used as metrics of generalization performance.
  • Yang Zhang, Tetsuya Ogata, Shun Nishide, Toru Takahashi, Hiroshi G. Okuno
    in Proc. of Joint 5th Int. Conf. on Soft Computing and Intelligent Systems and 11th International Symposium on advanced Intelligent Systems (SCIS & ISIS 2010) 378-383 2010年12月  査読有り
  • 水本 武志, 辻野 広司, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    情報処理学会論文誌 51(10) 2007-2019 2010年10月15日  
    本論文では,テルミンを演奏するロボットのためのテルミンの特性モデルと演奏動作生成手法について報告する.テルミンとは,演奏者の手の位置を動かして演奏する電子楽器である.楽器との物理的接触なしで連続的に音高と音量を操作できるので,ハードウェア構成が異なるロボットにも適用可能であるという点で,移植性が高い.テルミン演奏ロボットの主たる課題は,(1)動作生成の物理的な基準点がないので,演奏法学習に要する学習サンプル数が多くなること,および(2)演奏特性が静電的環境によって変化するので,適応的な演奏動作生成が必要であることの2点である.これらの課題に対して,我々は環境の影響をパラメータで表現した音高・音量特性モデルを構築し,少数の測定で音域内の任意の音を演奏できる制御手法を開発した.実験の結果,約12点の測定で音高が任意に制御できること,環境が変化しても所望の音高や音量で演奏できることを3種類のロボットで確認した.We present a theremin player robot towards an ensemble between humans and robots. A theremin, whose pitch and volume change continuously, can be played without any physical contacts. We thus expect that a robot system has high portability because it requires only few physical constraints. The problems for theremin playing are: (1) we have no physical reference points and (2) an environment affects sound characteristics seriously. To solve them, we develop a model-based feedforward arm control method based on our novel models of theremin's pitch and volume characteristics, which method realizes play an arbitrary sound using a few measurements. Experimental results show that our method works under four environments and three different robots.
  • Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of IEEE/RSJ-2010 Workshop on Robots and Musical Expression,CD-ROM 2010年10月  査読有り
  • Takeshi Mizumoto, Angelica Lim, Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of IEEE/RSJ-2010 Workshop on Robots and Musical Expression,CD-ROM 159-171 2010年10月  査読有り
  • Angelica Lim, Takeshi Mizumoto, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of IEEE/RSJ-2010 Workshop on Robots and Musical Expression,CD-ROM 2010年10月  査読有り
  • Shinpei Aso, Takuya Saitou, Masataka Goto, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of the 13th International Conference on Digital Audio Effects (DAFx-10) 2010年9月  査読有り
  • 奥乃 博, 中臺 一博, 高橋 徹
    電子情報通信学会ソサイエティ大会講演論文集 2010 "SS-72"-"SS-73" 2010年8月31日  
  • Akira Maezawa, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    Proceedings of 11th International Conference on Musical Information Retreival (ISMIR-2010) 2010年8月  査読有り
  • 安良岡 直希, 糸山克寿, 吉岡 拓也, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    研究報告音楽情報科学(MUS) 2010(20) 1-8 2010年7月21日  
    フレーズ置換とは,多重奏音響信号から特定パート演奏をユーザー指定の別楽譜による演奏に差し替えるものである.これは,1) 元々のフレーズ演奏成分を除去する音源分離の課題と,2)元演奏の音色や演奏表情を新しい演奏上で再現する演奏合成の課題からなる.我々は調波非調波Gaussian Mixture Model (GMM) による置換対象演奏モデルとNonnegative Matrix Factorizationによる伴奏モデルを用いて音源分離を行い,同時に調波非調波GMMから得た基本周波数,倍音強度などの音響特徴を新しい演奏楽譜のMIDI音源音響信号に転写することで元演奏の音響特性を持つ新しい演奏を合成する.本フレーズ置換法に対し1) 元の演奏が正しく除去されるか,2) 新しい演奏は元演奏の特徴を保持しているか,の2点を客観評価し,提案法の有効性を示す.This paper presents a music manipulating system that enables a user to replace an instrument performance phrase in polyphonic audio mixture. Two technical problems must be solved to realize this system: 1)separating the melody part from accompaniment, and 2)synthesizing a new instrument performance that has timbre and expression of the original one. Our method first performs the separation using statistical model integrating harmonic and inharmonic Gaussian mixture and nonnegative-matrix-factorization. Then our method synthesizes a new instrument performance by adding the acoustic characteristics given by Gaussian mixture parameters to a MIDI synthesizer-generated sound. Two evaluations confirm the effectiveness of the proposed method.
  • 前澤 陽, 後藤 真孝, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 143-144 2010年3月8日  
  • 安良岡 直希, 糸山 克寿, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 183-184 2010年3月8日  
  • 水本 武志, 大塚 琢馬, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博, 奥乃 博
    全国大会講演論文集 72 201-202 2010年3月8日  
  • 水本 武志, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 203-204 2010年3月8日  
  • 平澤 恭治, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 253-254 2010年3月8日  
  • 山川 暢英, 北原 鉄朗, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 257-258 2010年3月8日  
  • 穐山 空道, 駒谷 和範, 高橋 徹, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 291-292 2010年3月8日  
  • 阿曽 慎平, 齋藤 毅, 後藤 真孝, 糸山 克寿, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 295-296 2010年3月8日  
  • 粟野 皓光, 尾形 哲也, 高橋 徹, 駒谷 和範, 奥乃 博
    全国大会講演論文集 72 395-396 2010年3月8日  
  • 日下 航, 有江 浩明, 谷 淳, 尾形 哲也, 高橋 徹, 駒谷 和範, 奥乃 博
    全国大会講演論文集 72 525-526 2010年3月8日  
  • 武田 龍, 中臺 一博, 高橋 徹, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 27-28 2010年3月8日  
  • 高橋 徹, 中臺 一博, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 29-30 2010年3月8日  
  • 松山 匡子, 駒谷 和範, 高橋 徹, 尾形 哲也, 奥乃 博
    全国大会講演論文集 72 129-130 2010年3月8日  
  • 山川暢英, 高橋徹, 北原鉄朗, 尾形哲也, 奥乃博
    日本ロボット学会学術講演会予稿集(CD-ROM) 28th ROMBUNNO.1H2-4 2010年  
  • Toru Takahashi, Kazuhiro Nakadai, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    2010 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA) 470-475 2010年  査読有り
    This paper describes improvement of sound source separation for a simultaneous automatic speech recognition (ASR) system of a humanoid robot. A recognition error in the system is caused by a separation error and interferences of other sources. In separability, an original geometric source separation (GSS) is improved. Our GSS uses a measured robot's head related transfer function (HRTF) to estimate a separation matrix. As an original GSS uses a simulated HRTF calculated based on a distance between microphone and sound source, there is a large mismatch between the simulated and the measured transfer functions. The mismatch causes a severe degradation of recognition performance. Faster convergence speed of separation matrix reduces separation error. Our approach gives a nearer initial separation matrix based on a measured transfer function from an optimal separation matrix than a simulated one. As a result, we expect that our GSS improves the convergence speed. Our GSS is also able to handle an adaptive step-size parameter. These new features are added into open source robot audition software (OSS) called "HARK" which is newly updated as version 1.0.0. The HARK has been installed on a HRP-2 humanoid with an 8-element microphone array. The listening capability of HRP-2 is evaluated by recognizing a target speech signal which is separated from a simultaneous speech signal by three talkers. The word correct rate (WCR) of ASR improves by 5 points under normal acoustic environments and by 10 points under noisy environments. Experimental results show that HARK 1.0.0 improves the robustness against noises.
  • Ryu Takeda, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    2010 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA) 4366-4371 2010年  査読有り
    This paper presents the upper-limit evaluation of robot audition based on ICA-BSS in multi-source, barge-in and highly reverberant conditions. The goal is that the robot can automatically distinguish a target speech from its own speech and other sound sources in a reverberant environment. We focus on the multi-channel semi-blind ICA (MCSB-ICA), which is one of the sound source separation methods with a microphone array, to achieve such an audition system because it can separate sound source signals including reverberations with few assumptions on environments. The evaluation of MCSB-ICA has been limited to robot's speech separation and reverberation separation. In this paper, we evaluate MCSB-ICA extensively by applying it to multi-source separation problems under common reverberant environments. Experimental results prove that MCSB-ICA outperforms conventional ICA by 30 points in automatic speech recognition performance.
  • Takuma Otsuka, Takeshi Mizumoto, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT I, PROCEEDINGS 6096 102-+ 2010年  査読有り
    Our goal is to achieve a musical ensemble among a robot and human musicians where the robot listens to the music with its own microphones The maul issues ale (7) robust heat-tracking since the robot hears its own generated sounds in addition to the accompanied and (2) robust synchronizing it performance with the accompanied music even if humans' musical performance fluctuates This paper presents a. music-ensemble Therein mist robot implemented on the humanoid HRP-2 with the following three functions (1) self-generated Theremin sound suppression by semi-blind Independent Component Analysis, (2) beat tracking robust against, tempo fluctuation in humans' performance, and (3) feedforward control of Theremin pitch Experimental results with a. human drummer show the capability of this robot for the adaptation to the temporal fluctuation in his performance
  • Kyoko Matsuyama, Kazunori Komatani, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT II, PROCEEDINGS 6097 585-594 2010年  査読有り
    We describe a novel dialogue strategy enabling robust, interaction under noisy environments where automatic speech recognition (ASR) results ate not, necessarily tellable We have developed method that exploits utterance timing together with ASR. results to interpret, user intention, that, IS to identify one item that a user wants to indicate from system The timing of utterances containing icier:laud expressors is approximated by Gamma distribution which is integrated with ASK results by expressing of them as probabilities lit this paper, we improve the identification accuracy by extending the method First we enable interpretation of utterances including ordinal numbers, which appear several tunes in out data collected from users Then we use proper acoustic models and parameters; improving the identification accuracy by 4 0% in total We also show that Latent Semantic Mapping enables mole expressions to be handled in our framework
  • Akira Maezawa, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsnya Ogata, Hiroshi C. Okuno
    TRENDS IN APPLIED INTELLIGENT SYSTEMS, PT III, PROCEEDINGS 6098 249-259 2010年  査読有り
    This work presents an automated violin fingering estimation method that facilitates a student violinist acquire the "sound" of his/her favorite recording artist created by the artist's unique fingering. Our method realizes this by analyzing an audio recording played by the artist, and recuperating the most playable fingering that recreates the aural characteristics of the recording. Recovering the aural characteristics requires the bowed string estimation of an audio recording, and using the estimated result for optimal fingering decision. The former requires high accuracy mid robustness against the use of different violins or brand of strings; and the latter needs to create a natural fingering for the violinist. We solve the first problem by detecting estimation errors using rule-based algorithms, and by adapting the estimator to the recording based on mean normalization. We solve the second problem by incorporating; in addition to generic stringed-instrument model used in existing studies; a fingering model that is based on pedagogical practices of violin playing; defined OH a. sequence of two or three notes. The accuracy of the bowed string estimator improved by 21 points in a realistic situation (38% - 59%) by incorporating error correction and mean normalization. Subjective evaluation of the optimal fingering decision algorithm by seven violinists on 22 musical excerpts showed that compared to the model used in existing studies, our proposed model was preferred over existing one (p = 0.01) but no significant preference towards proposed method defined on sequence of two notes versus three notes was observed (p = 0.05).
  • Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10) 1238-1244 2010年  査読有り
    Our goal is to develop an interactive music robot, i.e., a robot that presents a musical expression together with humans. A music interaction requires two important functions: synchronization with the music and musical expression, such as singing and dancing. Many instrument-performing robots are only capable of the latter function, they may have difficulty in playing live with human performers. The synchronization function is critical for the interaction. We classify synchronization and musical expression into two levels: (1) the rhythm level and (2) the melody level. Two issues in achieving two-layer synchronization and musical expression are: (1) simultaneous estimation of the rhythm structure and the current part of the music and (2) derivation of the estimation confidence to switch behavior between the rhythm level and the melody level. This paper presents a score following algorithm, incremental audio to score alignment, that conforms to the two-level synchronization design using a particle filter. Our method estimates the score position for the melody level and the tempo for the rhythm level. The reliability of the score position estimation is extracted from the probability distribution of the score position. Experiments are carried out using polyphonic jazz songs. The results confirm that our method switches levels in accordance with the difficulty of the score estimation. When the tempo of the music is less than 120 (beats per minute; bpm), the estimated score positions are accurate and reported; when the tempo is over 120 (bpm), the system tends to report only the tempo to suppress the error in the reported score position predictions.
  • Hideki Kawahara, Masanori Morise, Toru Takahashi, Hideki Banno, Ryuichi Nisimura, Toshio Irino
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2 38-+ 2010年  査読有り
    A systematic framework for non-periodic excitation source representation is proposed for high-quality speech manipulation systems such as TANDEM-STRAIGHT, which is basically a channel VOCODER. The proposed method consists of two subsystems for non-periodic components; a colored noise source and an event analyzer/generator. The colored noise source is represented by using a sigmoid model with non-linear level conversion. Two model parameters, boundary frequency and slope parameters, are estimated based on pitch range linear prediction combined with F0 adaptive temporal axis warping and those on the original temporal axis. The event subsystem detects events based on kurtosis of filtered speech signals. The proposed framework provides significant quality improvement for high-quality recorded speech materials.
  • Kyoko Matsuyama, Kazunori Komatani, Ryu Takeda, Toru Takahashi, Tetsuya Ogata, Hiroshi G. Okuno
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4 3050-3053 2010年  査読有り
    In our barge-in-able spoken dialogue system, the user's behaviors such as barge-in timing and utterance expressions vary according to his/her characteristics and situations. The system adapts to the behaviors by modeling them. We analyzed 1584 utterances collected by our systems of quiz and news-listing tasks and showed that ratio of using referential expressions depends on individual users and average lengths of listed items. This tendency was incorporated as a prior probability into our method and improved the identification accuracy of the user's intended items.
  • Nobuhide Yamakawa, Tetsuro Kitahara, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4 2342-+ 2010年  査読有り
    Research on environmental sound recognition has not shown great development in comparison with that on speech and musical signals. One of the reasons is that the sound category of environmental sounds covers a broad range of acoustical natures. We classified them in order to explore suitable recognition techniques for each characteristic. We focus on impulsive sounds and their non-stationary feature within and between analytic frames. We used matching-pursuit as a framework to use wavelet analysis for extracting temporal variation of audio features inside a frame. We also investigated the validity of modeling decaying patterns of sounds using Hidden markov models. Experimental results indicate that sounds with multiple impulsive signals are recognized better by using time-frequency analyzing bases than by frequency domain analysis. Classification of sound classes with a long and clear decaying pattern improves when HMMs with multiple number of hidden states are applied.
  • Hiromitsu Awano, Tetsuya Ogata, Shun Nishide, Torn Takahashi, Kazunori Komatani, Hiroshi G. Okuno
    IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC 2010) 2010年  査読有り
    The objective of our study was to develop dynamic collaboration between a human and a robot. Most conventional studies have created pre-designed rule-based collaboration systems to determine the timing and behavior of robots to participate in tasks. Our aim is to introduce the confidence of the task as a criterion for robots to determine their timing and behavior. In this paper, we report the effectiveness of applying reproduction accuracy as a measure for quantitatively evaluating confidence in an object arrangement task. Our method is comprised of three phases. First, we obtain human-robot interaction data through the Wizard of OZ method. Second, the obtained data are trained using a neuro-dynamical system, namely, the Multiple Time-scales Recurrent Neural Network (MTRNN). Finally, the prediction error in MTRNN is applied as a confidence measure to determine the robot's behavior. The robot participated in the task when its confidence was high, while it just observed when its confidence was low. Training data were acquired using an actual robot platform, Hiro. The method was evaluated using a robot simulator. The results revealed that motion trajectories could be precisely reproduced with a high degree of confidence, demonstrating the effectiveness of the method.
  • Ryu Takeda, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010) 1949-1956 2010年  査読有り
    This paper describes a speedup and performance improvement of multi-channel semi-blind ICA (MCSB-ICA) with parallel and resampling-based block-wise processing. MCSB-ICA is an integrated method of sound source separation that accomplishes blind source separation, blind dereverberation, and echo cancellation. This method enables robots to separate user's speech signals from observed signals including the robot's own speech, other speech and their reverberations without a priori information. The main problem when MCSB-ICA is applied to robot audition is its high computational cost. We tackle this by multi-threading programming, and the two main issues are 1) the design of parallel processing and 2) incremental implementation. These are solved by a) multiple-stack-based parallel implementation, and b) resampling-based overlaps and block-wise separation. The experimental results proved that our method reduced the real-time factor to less than 0.5 with an eight-core CPU, and it improves the performance of automatic speech recognition by 2-10 points compared with the single-stack-based parallel implementation without the resampling technique.
  • Takeshi Mizumoto, Takuma Otsuka, Kazuhiro Nakadai, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010) 1957-1963 2010年  査読有り
    This paper presents a novel synchronizing method for a human-robot ensemble using coupled oscillators. We define an ensemble as a synchronized performance produced through interactions between independent players. To attain better synchronized performance, the robot should predict the human's behavior to reduce the difference between the human's and robot's onset timings. Existing studies in such synchronization only adapts to onset intervals, thus, need a considerable time to synchronize. We use a coupled oscillator model to predict the human's behavior. Experimental results show that our method reduces the average of onset time errors; when we use a metronome, a tempo-varying metronome or a human drummer, errors are reduced by 38%, 10% or 14% on the average, respectively. These results mean that the prediction of human's behaviors is effective for the synchronized performance.
  • Angelica Lim, Takeshi Mizumoto, Louis-Kenzo Cahier, Takuma Otsuka, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010) 1964-1969 2010年  査読有り
    Musicians often have the following problem: they have a music score that requires 2 or more players, but they have no one with whom to practice. So far, score-playing music robots exist, but they lack adaptive abilities to synchronize with fellow players' tempo variations. In other words, if the human speeds up their play, the robot should also increase its speed. However, computer accompaniment systems allow exactly this kind of adaptive ability. We present a first step towards giving these accompaniment abilities to a music robot. We introduce a new paradigm of beat tracking using 2 types of sensory input - visual and audio - using our own visual cue recognition system and state-of-the-art acoustic onset detection techniques. Preliminary experiments suggest that by coupling these two modalities, a robot accompanist can start and stop a performance in synchrony with a flutist, and detect tempo changes within half a second.
  • Toru Takahashi, Kazuhiro Nakadai, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010) 964-969 2010年  査読有り
    We describe integration of preprocessing and automatic speech recognition based on Missing-Feature-Theory (MFT) to recognize a highly interfered speech signal, such as the signal in a narrow angle between a desired and interfered speakers. As a speech signal separated from a mixture of speech signals includes the leakage from other speech signals, recognition performance of the separated speech degrades. An important problem is estimating the leakage in time-frequency components. Once the leakage is estimated, we can generate missing feature masks (MFM) automatically by using our method. A new weighted sigmoid function is introduced for our MFM generation method. An experiment shows that a word correct rate improves from 66 % to 74 % by using our MFM generation method tuned by a search base approach in the parameter space.
  • Shun Nishide, Tetsuya Ogata, Jun Tani, Toru Takahashi, Kazunori Komatani, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010) 2010年  査読有り
    Predictability is an important factor for determining robot motions. This paper presents a model to generate robot motions based on reliable predictability evaluated through a dynamics learning model which self-organizes object features. The model is composed of a dynamics learning module, namely Recurrent Neural Network with Parametric Bias (RNNPB), and a hierarchical neural network as a feature extraction module. The model inputs raw object images and robot motions. Through bi-directional training of the two models, object features which describe the object motion are self-organized in the output of the hierarchical neural network, which is linked to the input of RNNPB. After training, the model searches for the robot motion with high reliable predictability of object motion. Experiments were performed with the robot's pushing motion with a variety of objects to generate sliding, falling over, bouncing, and rolling motions. For objects with single motion possibility, the robot tended to generate motions that induce the object motion. For objects with two motion possibilities, the robot evenly generated motions that induce the two object motions.
  • Yasuharu Hirasawa, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno
    IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010) 450-457 2010年  査読有り
    In real-world situations, a robot may often encounter "under-determined" situation, where there are more sound sources than microphones. This paper presents a speech separation method using a new constraint on the harmonic structure for a simultaneous speech-recognition system in under-determined conditions. The requirements for a speech separation method in a simultaneous speech-recognition system are (1) ability to handle a large number of talkers, and (2) reduction of distortion in acoustic features. Conventional methods use a maximum likelihood estimation in sound source separation, which fulfills requirement (1). Since it is a general approach, the performance is limited when separating speech. This paper presents a two-stage method to improve the separation. The first stage uses maximum likelihood estimation and extracts the harmonic structure, and the second stage exploits the harmonic structure as a new constraint to achieve requirement (2). We carried out an experiment that simulated three simultaneous utterances using impulse responses recorded by two microphones in an anechoic chamber. The experimental results revealed that our method could improve speech recognition correctness by about four points.
  • 前澤 陽, 糸山 克寿, 高橋 徹, 尾形 哲也, 奥乃 博
    研究報告音楽情報科学(MUS) 2009(5) 1-6 2009年7月22日  
    本報告ではコンテキストベースの規則と音響信号を併用したバイオリン演奏弦系列推定手法を提案する.音響信号から演奏弦系列を推定し,それの規則に合わない箇所を訂正することにより認識率の向上を図る.6 楽節での実験の結果,学習データと同一の弦の場合最大8%,平均 5%,別の銘柄の弦の場合最大 15%,平均 7% の認識率の向上が確認される.We present a violin bowed string sequence identification method by combining context-based rules and audio-based bowed string estimator. Using audio-based estimator followed by error correction using context-based rules increases the accuracy of the estimator. Using six musical phrases, we confirm that the accuracy increases on average by 5% (max. 8%) when using the set of strings used for training, and, when using different brand of strings than that used for training, confirm 7% increase on average (max. 15%).
  • 安良岡 直希, 糸山 克寿, 高橋 徹, 尾形 哲也, 奥乃 博
    研究報告音楽情報科学(MUS) 2009(10) 1-6 2009年7月22日  
    本報告書では,楽器演奏音響信号の分析合成における,入力中の伴奏音や残響成分を抑制した分析手法を報告する.対象演奏パートの楽譜情報に合致しないスペクトル成分を表現する残差スペクトルモデルを導入し, これを用いて伴奏や残響を含む音響信号から対象の演奏を効率よく分離する. 調波非調波統合音モデルに用いた演奏分析をこの分離と同時に行い, 分析された音モデルを用いて未知楽譜への演奏を合成する.評価実験では, 伴奏付き演奏に対する分析精度が本手法によりスペクトル距離において平均 35.2% 改善し, また残響を含む演奏に対する分析合成精度の低下を回避できる事が確認された.This paper presents a musical performance analysis-and-synthesis method using residual model for reduction of accompaniment or sound reverberation. The residual model is designed for representing spectrum that the score does not convey about the performance. This leads to an efficient extraction of a performed part from accompanied and/or reverberant audio source. The extraction is performed simultaneously with estimation of musical tone models that represent both harmonic and inharmonic sound of the performance. Using the estimated tone models, a new performance sound corresponding to a new given score is synthesized. An experiment showed that the spectral distance of one instrument part extracted from polyphonic audio source improved by 35.0 points by incorporating the residual model. Another result showed the effectiveness of our method under reverberant source.
  • 水本 武志, 合原 一究, 高橋 徹, 尾形 哲也, 奥乃 博
    全国大会講演論文集 71 169-170 2009年3月10日  
  • 高橋 徹, 中臺 一博, 駒谷 和範, 尾形 哲也, 奥乃 博
    全国大会講演論文集 71 35-36 2009年3月10日  
  • 中川 達裕, 尾形 哲也, 谷 淳, 高橋 徹, 奥乃 博
    全国大会講演論文集 71 53-54 2009年3月10日  
  • 勝丸 真樹, 中野 幹生, 駒谷 和範, 成松 宏美, 船越 孝太郎, 辻野 広司, 高橋 徹, 尾形 哲也, 奥乃 博
    全国大会講演論文集 71 117-118 2009年3月10日  
  • 池田 智志, 駒谷 和範, 高橋 徹, 尾形 哲也, 奥乃 博
    全国大会講演論文集 71 121-122 2009年3月10日  

書籍等出版物

 8

講演・口頭発表等

 80

担当経験のある科目(授業)

 18

所属学協会

 6

Works(作品等)

 1

共同研究・競争的資金等の研究課題

 15

産業財産権

 5

研究テーマ

 1
  • 研究テーマ
    ヒューマンロボットインタラクション,音声コミュニケーション,音声認識,音環境理解,
    キーワード
    マイクロホンアレイ,音響特徴量,音声認識,音源定位,音源分離
    概要
    ロボットと人の自然な対話を実環境において実現するための課題に取り組んでいる