Text and content-based retrieval of video is a critical component of a DIVL for automatic indexing and retrieval. This is one of the most active research areas in the United States; published papers and prototype software systems are too numerous to list here. Good overviews of recent efforts can be seen in DLII n.d., ACM 1997a, ACM 1997b, AAAI 1997, and IEEE 1998. More detailed discussions of text-related search can be found in Chapter 5. Several Japanese companies are also actively involved in this area. Two demonstrations in the image area were shown by NTT researchers. One involves reading the Japanese captions from TV broadcasts so that topic- or concept-based video retrieval can be accomplished. Key algorithmic steps involved are detection of frames that contain text, extraction of text regions, character segmentation and recognition. Details of these steps can be found in Kurakaka, Kuwano and Odaka (1997). The other demonstration was of ExSight, a multimedia retrieval system using object-based image matching and keyword-based retrieval (Yamamuro et al. 1998). Unlike pixel- or impression-based approaches, object-based approaches such as ExSight search over a large database using content. The steps involved include automatic object extraction, feature extraction (color, shape, etc.) and high-speed similarity matching. Query fusion (as a union of image objects) and high-speed browsing are provided as Java applets. Potential commercialization applications are in electronic commerce, digital museums (show all the pictures of a boy with a dog), and digital photo albums. Although primarily image-content driven, the system can accommodate keyword-based retrieval. A functional diagram of ExSight is shown in Figure. 6.3.

Fig. 6.3. Functional diagram of ExSight (NTT).
When audio books and video are collected and bound as digital objects, it is critical to provide user-friendly interfaces to access them. In the CyberShelf project, books created from HTML documents are accessible using a book metaphor description language. Another interesting demonstration was an image mosaicking system that produces a panoramic view from a sequence of translating images. User-friendly interfaces to the mosaicking algorithms have been provided. Details of the mosaicking algorithms are found in Akutsu et al. 1995 and in Taniguchi et al. 1997.