Text Clustering or automatic grouping of texts is used to divide a set of texts into groups, so called clusters. The goal is to produce clusters such that texts in the same cluster are more similar in content than texts from different clusters.
Many sets of texts are partitioned manually as a matter of routine, in libraries and news papers (the sections of the paper) for instance. These partitions are static and are not always suitable. A new partition may shed new light on a set of texts.
The result of a text clustering is dependent on the way the texts are represented. We have investigated how some aspects of the Swedish language affect the result. In connection to this we have also studied evaluation of text clustering. It is very hard to define what a good partition of a set of texts is. Hence it is also very hard to measure.
We believe text clustering will become an important tool for exploration and analysis of open text answers in questionnaires. The information in free text answers is almost never used since it is too hard and expensive to do a manual analysis. By using automatic partitions it is easier to find connections and similarities among the answers. We cooperate with the Department of Medical Epidemiology and Biostatistics at the Karolinska Institutet (The Swedish Medical University) to investigate these possibilities.
Up to Magnus' hompage.