DBLab School of Computer and Electrical Engineering KDBSL NTUA
Thursday, July 02, 2020
Τίτλος Scalable Clustering of Categorical Data and Applications
Έγγραφο Προβολή εγγράφου
Συγγραφέας Periklis Andritsos
Περιγραφή

Clustering is a problem of great practical importance in numerous
applications. The problem of clustering becomes more challenging when
the data is categorical, that is, when there is no inherent distance
measure between data values. In this talk, we introduce LIMBO, a
scalable hierarchical categorical clustering algorithm that uses an
intuitive information-theoretic distance measure for categorical tuples
and values. When clustering values, LIMBO can give useful hints about
potential duplication and errors that may exist in a data set. As a
hierarchical algorithm, LIMBO has the advantage that it can produce
clusterings of different sizes in a single execution and within a memory
bounded summary model for the data. We present results from our
experimental evaluation of LIMBO, which show the increase in efficiency
without significant loss in the quality of the produced clusterings. We
move on to show how the algorithm can be used to produce valid and
useful clusterings of large software systems. In this case, LIMBO is
applied in the presence of both structural and non-structural
information about the software systems and, thus, allows for an
evaluation of their usefulness in understanding them. Finally, we
conclude the talk with a set of research challenges that present
themselves for the future.

[ Back ]