In this paper we present a data- compression oriented approach to the information retrieval task in the scientific collection of GIRT. For this purpose we use a recently proposed general scheme for context recognition and context classification of strings of characters (in particular texts) or other coded information. Based on data-compression techniques, the key point of the method is the computation of a suitable measure of remoteness of two strings of characters. This measure of remoteness only reflects the distance in information between the two strings, i.e. the differences between the syntactic/structural elements of the sequences. The question we address is whether the informatic measure of remoteness between two sequences could account for their semantic distance. We have focused in particular on the monolingual GIRT tasks for German and English and we present here the results. It is worth stressing the generality and versatility of our information-theoretic method. It applies, in fact, to any kind of corpora of character strings, independent of the type of coding behind them. For texts, it is then language independent since it prescinds from any linguistic knowledge.
Data Compression Approach to monolingual GIRT Task: an agnostic point of view
Alderuccio D.Membro del Collaboration Group
;Bordoni L.Project Administration
;
2003-01-01
Abstract
In this paper we present a data- compression oriented approach to the information retrieval task in the scientific collection of GIRT. For this purpose we use a recently proposed general scheme for context recognition and context classification of strings of characters (in particular texts) or other coded information. Based on data-compression techniques, the key point of the method is the computation of a suitable measure of remoteness of two strings of characters. This measure of remoteness only reflects the distance in information between the two strings, i.e. the differences between the syntactic/structural elements of the sequences. The question we address is whether the informatic measure of remoteness between two sequences could account for their semantic distance. We have focused in particular on the monolingual GIRT tasks for German and English and we present here the results. It is worth stressing the generality and versatility of our information-theoretic method. It applies, in fact, to any kind of corpora of character strings, independent of the type of coding behind them. For texts, it is then language independent since it prescinds from any linguistic knowledge.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.