In this paper we present a data- compression oriented approach to the information retrieval task in the scientific collection of GIRT. For this purpose we use a recently proposed general scheme for context recognition and context classification of strings of characters (in particular texts) or other coded information. Based on data-compression techniques, the key point of the method is the computation of a suitable measure of remoteness of two strings of characters. This measure of remoteness only reflects the distance in information between the two strings, i.e. the differences between the syntactic/structural elements of the sequences. The question we address is whether the informatic measure of remoteness between two sequences could account for their semantic distance. We have focused in particular on the monolingual GIRT tasks for German and English and we present here the results. It is worth stressing the generality and versatility of our information-theoretic method. It applies, in fact, to any kind of corpora of character strings, independent of the type of coding behind them. For texts, it is then language independent since it prescinds from any linguistic knowledge.

Data Compression Approach to monolingual GIRT Task: an agnostic point of view

Alderuccio D.
Membro del Collaboration Group
;
Bordoni L.
Project Administration
;
2003-01-01

Abstract

In this paper we present a data- compression oriented approach to the information retrieval task in the scientific collection of GIRT. For this purpose we use a recently proposed general scheme for context recognition and context classification of strings of characters (in particular texts) or other coded information. Based on data-compression techniques, the key point of the method is the computation of a suitable measure of remoteness of two strings of characters. This measure of remoteness only reflects the distance in information between the two strings, i.e. the differences between the syntactic/structural elements of the sequences. The question we address is whether the informatic measure of remoteness between two sequences could account for their semantic distance. We have focused in particular on the monolingual GIRT tasks for German and English and we present here the results. It is worth stressing the generality and versatility of our information-theoretic method. It applies, in fact, to any kind of corpora of character strings, independent of the type of coding behind them. For texts, it is then language independent since it prescinds from any linguistic knowledge.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12079/60897
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
social impact