Laboration 1 - The Infomat GUI and Basic Clustering
Start and Preprocessing
Start the class InfomatGUI using your
programming environment or from a command prompt
Open the file menu and choose "Open tokenFile".
There is a pattern in the picture. Why? What is the order of the texts (rows) and the words (columns)?
The preprocessing is not very good - there is a lot of bad "words". There are several horizontal lines. What are they? Use the Pixel View to find out. This is the main tool to get textual information out of Infomat. Read about it in the manual. (It opens via the Views menu. Don't forget to choose the Pixel Selection in the Image menu or pressing the corresponding toolbar button.) Read some of the strange texts, by clicking at their buttons in the lists of the Pixel View window.
Remove the corresponding words, using the Remove columns option in the Image menu (or the button in the toolbar). Then Purge the matrix using the option in the Tool menu. The columns you removed in the picture are now removed from the matrix as well.
Save the Matrix
Save the matrix using the Save matrix option in the File menu.
Call the file(s)
The Stoplist option in the Tools menu allows you to remove stop words. It is divided into three sections. The leftmost contains some properties. Look at them, then click the "Whole stoplist to IO" button in the rightmost section. You now have only the words along the columns that agree with the properties. Sift through the list. If there are any words you do not want to remove select them and choose "Remove" in the "Select" drop-down menu, then press "Apply" to its right. You can read more about the Stoplist in the manual.
When you are done press the Apply button. The stoplist tool automatically purges the matrix.
There is a stoplist
You can also remove objects using the Pixel View window. Try it! Don't forget to purge the matrix when you are done.
There is one more way to remove uninteresting stuff - the Filter Matrix option in the Algorithms menu. Look at the Properties and alter them. Use Filter Matrix last during removal. It allows you to remove texts and words that no longer have any matrix elements corresponding to them. The filtering is done on the current matrix, so if you apply it again it may removes even more.
When you are down to around 10 000 good words you can go on.
When you have removed what is necessary it is time to weight the matrix, using the Weight Matrix option in the Algorithms menu. Look at the Properties before you press the Apply button. You can use the default settings.
Remember the following order: remove, purge, weight. You have to apply these actions, in this order, for them to affect the following algorithms.
Save a few matrixes with different amounts of preprocessing: with/without stoplist, filtering, removal of different amounts of bad "words", etc. You can use these in laboration 2. Or you can go back and do this later (for instance when you do laboration 2).
Now you are ready to cluster! Open the Clustering Algorithms window from the Algorithms menu. It is set to rows and the K-Means algorithm is chosen. Look at its Properties. Then press the Apply button.
The texts (rows) are grouped into five clusters along the vertical axis. The columns have the same order as before. Press the two buttons to the left in the toolbar. The clusters are now separated by horizontal lines. You can see a difference in the distributions of words between the text clusters. Using the Pixel View you can try to decide what the clusters are about.
What is the order of the texts within the clusters? (Look at the K-Means Properties.)
To understand the cluster content better you can use the Relative Clusterer in the Clustering Algorithm window. Choose Columns! Explore the words in the corresponding word cluster for each text cluster. Do you get a better idea of what the clusters are about?
Don't forget that K-Means can give different results each time.
Grouping Edit Window
You can also look at the clusterings using the Grouping Edit and Group Edit windows. You open the former by pressing the E button in the Grouping panel. Here you can rename the clusters, reorder them and save the clustering to a file. Saved groupings can also be loaded again. Try it! Matrixes are not saved with their groupings. You have to save all groupings separately.
Group Edit Window
In the Grouping Edit window there is an E button for each group. Press one and the corresponding Group Edit window opens. Here you can resort the order of the groups and for some objects (the texts in our example, for instance) open them in the viewer. Looking at some of the texts in a cluster might give you further insight into what the cluster is about.
The difference in word distribution between the clusters can be considered a visual internal evaluation. The more obvious the difference in distribution the better the clustering. The relative clustering of words can help in observing the differences.
The original text files are stored in directories based on their manual category. You can construct a grouping that follows this categorization using the Location Grouper in the Clustering Algorithms window. (Remember to choose Rows.)
Choose the Color grouping along the rows to be the newly created location grouping. Now the shown and color grouping are the same. You can click the E button for the color grouping to see what the colors represent.
Change the Shown grouping back to be your K-Means clustering. Now the coloring lets you see the distribution of the categories over the clusters - a visual external evaluation.
You can also achieve ordinary evaluation using the Evaluation option in the Tools menu. Choose a grouping to evaluate and a reference grouping and press Evaluate. (If you choose no reference grouping you get only internal measures.) Save your result to a file that you name appropriately. Open the file with your browser (that supports xml and xsl).
A grouping can be saved to a xml-file that can be viewed with a browser in the Export to text option in the Tools menu. Try this. You can click on any link in the browser to open clusters and texts, etc.
Now try to get a better result than the first by any means:
Is any of the newsgroups easier for the algorithm to find? Why do you think?
Remember to store your work (matrix and groupings) once in a while!
This laboration is just an introduction. You are not supposed to write a long report, but we would like you to answer some questions. Your answers will be treated like a user questionnaire. It will help Magnus to get ideas on how to improve the program. The analysis of your answers (and perhaps some of your comments) will be reported in a report/paper on the program. For your own sake: do not write long essays...
Download the questionnaire file. Answer the "dotted" questions in the file while you do the laboration. Send it as an attachment to rosell at csc kth se
Some more questions:
Up to Clustering Course.