Text Mining for Metadata Extraction and Semantic Retrieval

Project summary

As full-text-based search capabilities are limited due to their lack of semantics, the introduction of metadata is a widespread approach. Metadata is additional structured (human and machine readable) information added to text documents in a process called annotation. Metadata has been used in document and knowledge management systems for decades, however only recently the emergence of the vision of the Semantic Web has lead to standardized tools and techniques for metadata management (e.g. RDF, Dublin Core).

However, we still see two major problems with the use of metadata in knowledge and document management systems, as well as in a Semantic Web context. First, the standard approach for searching metadata is to perform an exact query based on a Boolean search constraint specified by the user (e.g. title = x AND author = y). There is no fuzziness or ranking as known from classic Information Retrieval models. However, we argue that a fuzzy ranking functionality is needed also for metadata searching. A metadata model (and a corresponding ontology) can be expected to become quite complex. It is thus hard for users to build search queries that "perfectly" represent their information need. Also, the metadata quality is questionable as it depends on the voluntary annotation by humans. This leads to the second major problem, the lack of user acceptance due to the need for manual annotation of text documents. A user will only be willing to perform this extra work, if he sees a direct benefit from doing so. However, a significant improvement in fulfilling the users' information needs will only be reached if a certain critical mass of annotated documents exists.

As a consequence to the above motivation, the project aims at developing and evaluating new tools and techniques for fuzzy metadata-based searches and automated metadata extraction. In particular the two goals are:

For the metadata management Semantic Web technologies such as the Resource Description Framework (RDF) and the Dublin Core Metadata standard will be applied. The project will elicit the complementary experience of the partner universities in Košice (Text Mining, Java-based system supporting pre-processing, indexing and mining of text document collections) and Regensburg (Information Retrieval using fuzzy queries on RDF metadata).

Project participants

Technical University of Kosice University of Regensburg