|
On-line Multi-Modal Speaker Diarization |
PDF |
| Print | |
|
The greatest drawback of multi-modal analysis is that is has to be performed off-line. The parameters of multi-modal models can easily explode! In this case, we require a sufficient amount of training data to acquire good estimates of all our parameters. In this work we show how we can start by using a single-modality, robust model to infer the quantities in question and move gradually to a more complex multimodal one as more information becomes available.
|
|
Read more...
|
|
|
Learning in multi-modal streams |
PDF |
| Print | |
Learning in Multi-Modal Streams Graphical representation of the CE workings Learning in multimodal streams is a hard task. Usually, the stream is long (a movie of 2 hours has more than 200.000 frames!) and the state space is large. Furthermore, there are many scenarios (like speaker diarization) where no training data is available. The most commonly used technique to learn under such circumstances is the EM algorithm. In this paper we show how we can search for a maximum-likelihood set of parameters in a DBN by means of the Cross Entropy method. |
|
Read more...
|
|
Audio-Visual Person Detection |
PDF |
| Print | |
Scenario 1: Audio-Visual Person Detection  A video-frame from a dialogue-clip Multimedia data content extraction is a task with large variety of possible applications. In all state-of-the-art methods, a specific dataset is being examined, and data-specific techniques are used. The dataset we explored here is a multi-media stream containing different people talking to the camera. This kind of data appears in many real world situations, like for instance news anchors, talk shows or interviewed people. The objective of this work will be to create a framework able to detect how many people appear in the video data, how many people speak in the accompanying audio data and - most important - associate each person with the corresponding video and audio segments they generated. |
|
Read more...
|
|
|
|
|
|