Text Mining for Metadata Extraction and Semantic Retrieval
Project summary
As full-text-based search capabilities are limited due to their lack of semantics, the introduction of metadata is a widespread approach. Metadata is additional structured (human and machine readable) information added to text documents in a process called annotation. Metadata has been used in document and knowledge management systems for decades, however only recently the emergence of the vision of the Semantic Web has lead to standardized tools and techniques for metadata management (e.g. RDF, Dublin Core).
However, we still see two major problems with the use of metadata in knowledge and document management systems, as well as in a Semantic Web context. First, the standard approach for searching metadata is to perform an exact query based on a Boolean search constraint specified by the user (e.g. title = x AND author = y). There is no fuzziness or ranking as known from classic Information Retrieval models. However, we argue that a fuzzy ranking functionality is needed also for metadata searching. A metadata model (and a corresponding ontology) can be expected to become quite complex. It is thus hard for users to build search queries that "perfectly" represent their information need. Also, the metadata quality is questionable as it depends on the voluntary annotation by humans. This leads to the second major problem, the lack of user acceptance due to the need for manual annotation of text documents. A user will only be willing to perform this extra work, if he sees a direct benefit from doing so. However, a significant improvement in fulfilling the users' information needs will only be reached if a certain critical mass of annotated documents exists.
As a consequence to the above motivation, the project aims at developing and evaluating new tools and techniques for fuzzy metadata-based searches and automated metadata extraction. In particular the two goals are:
- 1. Effective support of metadata-based Information Retrieval using the (extended) Vector Space Model (VSM).
- 2. Extraction of metadata from text collections using Text Mining techniques such as text classification or clustering.
For the metadata management Semantic Web technologies such as the Resource Description Framework (RDF) and the Dublin Core Metadata standard will be applied. The project will elicit the complementary experience of the partner universities in Košice (Text Mining, Java-based system supporting pre-processing, indexing and mining of text document collections) and Regensburg (Information Retrieval using fuzzy queries on RDF metadata).
Project participants
Technical University of Kosice | University of Regensburg |
Publications
- Bednár, P., Butka, P. and Paralič, J. (2005): Java library for
support of text mining and retrieval. In Proc. from Znalosti 2005,
Stara Lesna, High Tatras, 2005, pp. 162-169
- Priebe, T., Kiss, C. and Kolter, J.: Semiautomatische Annotation von Textdokumenten mit semantischen Metadaten. Proc. 7. Internationale Tagung Wirtschaftsinformatik (WI 2005), Bamberg, Germany, February 2005
- Bednár, P. (2005): Word sense desambiguation using Wordnet. Proc. of the 6th Workshop on Data Analysis (WDA 2005), Abaújszántó, Hungary, June 2005, pp. 70-74
- Butka, P. (2005): Use of ontologies for information retrieval in the semantic web environment. Proc. of the 6th Workshop on Data Analysis (WDA 2005), Abaújszántó, Hungary, June 2005, pp. 10-15
- Sarnovský, M. (2005): Integration of text mining services in the GRID-Miner system. Proc. of the 6th Workshop on Data Analysis (WDA 2005), Abaújszántó, Hungary, June 2005, pp. 64-69
- Paralič, J. - Smatana, P. (2005): Transformation of Free-text
Electronic Health Records for Efficient Information Retrieval and
Support of Knowledge Discovery. Proc. of the 16th International
Conference on Information and Intelligent Systems (IIS 2005), Varaždin,
Chroatia, September 2005, pp. 139-144
- Butka, P. (2005): Aplikácia zhlukovania fuzzy konceptov v doméne textových dokumentov. Proc. of the ITAT 2005, Information Technologies - Applications and Theory, Workshop on Theory and Practice of Information Technologies, Račkova Dolina, Slovakia, September 2005, pp. 31-40