bild
Skolan för
datavetenskap
och kommunikation

Clustering Course

Laboration 1 - The Infomat GUI and Basic Clustering

Preparation

  1. Install java SE6. The program might work with SE5, but there is no guarantee.
  2. To view some files you need a browser that supports xml-xsl. Most modern do.
  3. Download the example corpus: example20ng.zip. Put the zip file somewhere and unzip it. Familiarize yourself with the texts in the example. They are stored in the example20ng/texts/ directory in five subdirectories according to category. The example is a part of the 20 newsgroups corpus, which is freely available.
  4. Alter the documentpath item in the file tokenFile.xml so that it points to the directory example20ng/texts/ on your computer. The token file is the corpus preprocessed: we have removed the headers of the texts, tokenized and stemmed. (The file is rather big. Emacs is a good program to use. For Windows: rename the file tokenFile.txt, open it with Wordpad, alter it, save it, rename it back.)
  5. Download the latest version of the Infomat program here. Unzip it. Read the following:
    • the readme.txt file. (You may need to change the paths as described.)
    • the infomatmanYYMMDD.pdf file. Chapter 1 and Sections 2.1 - 2.2, and 2.9. Keep the manual close for the rest of the laboration.
  6. Skim these instructions before you start. Read the last section "Report". Download the questionnaire file. Answer the questions in the file while you work on the laboration.

Start and Preprocessing

Start the class InfomatGUI using your programming environment or from a command prompt (see also readme.txt):

.../Infomat>java -cp classes/ -Xms1024m -Xmx1024m infomat.InfomatGUI

Open the file menu and choose "Open tokenFile". Choose the tokenFile.xml in the example20ng directory. After loading the main view shows a picture corresponding to a matrix with roughly 2500 texts (rows) and over 50000 words (columns). Be patient - the matrix is large.

There is a pattern in the picture. Why? What is the order of the texts (rows) and the words (columns)?

The preprocessing is not very good - there is a lot of bad "words". There are several horizontal lines. What are they? Use the Pixel View to find out. This is the main tool to get textual information out of Infomat. Read about it in the manual. (It opens via the Views menu. Don't forget to choose the Pixel Selection in the Image menu or pressing the corresponding toolbar button.) Read some of the strange texts, by clicking at their buttons in the lists of the Pixel View window.

Remove Columns

Remove the corresponding words, using the Remove columns option in the Image menu (or the button in the toolbar). Then Purge the matrix using the option in the Tool menu. The columns you removed in the picture are now removed from the matrix as well.

Save the Matrix

Save the matrix using the Save matrix option in the File menu. Call the file(s) matrix01.xml or something more imaginative, so that you will remember what they are. If you need to backtrack your work you can reload a previous version through the File menu. Try this now.

Stoplist

The Stoplist option in the Tools menu allows you to remove stop words. It is divided into three sections. The leftmost contains some properties. Look at them, then click the "Whole stoplist to IO" button in the rightmost section. You now have only the words along the columns that agree with the properties. Sift through the list. If there are any words you do not want to remove select them and choose "Remove" in the "Select" drop-down menu, then press "Apply" to its right. You can read more about the Stoplist in the manual.

When you are done press the Apply button. The stoplist tool automatically purges the matrix.

There is a stoplist eng_stoplist.txt in the files directory in the Infomat directory. Load it in the middle section, then press either the "Whole Stoplist to IO" or the "From Strings to IO" button to get a new list in the right section containing the words that are both in the file and in the columns of the matrix. (Both buttons do the same now, since you have removed the words that agree with the properties.) Remove them.

You can also remove objects using the Pixel View window. Try it! Don't forget to purge the matrix when you are done.

Filter Matrix

There is one more way to remove uninteresting stuff - the Filter Matrix option in the Algorithms menu. Look at the Properties and alter them. Use Filter Matrix last during removal. It allows you to remove texts and words that no longer have any matrix elements corresponding to them. The filtering is done on the current matrix, so if you apply it again it may removes even more.

When you are down to around 10 000 good words you can go on.

Weighting

When you have removed what is necessary it is time to weight the matrix, using the Weight Matrix option in the Algorithms menu. Look at the Properties before you press the Apply button. You can use the default settings.

Remember the following order: remove, purge, weight. You have to apply these actions, in this order, for them to affect the following algorithms.

  • Preprocessing. What do you think of these preprocessing tools? Are they useful? Is the visualization of any help? Is any of the tools better or worse? Why?

Save a few matrixes with different amounts of preprocessing: with/without stoplist, filtering, removal of different amounts of bad "words", etc. You can use these in laboration 2. Or you can go back and do this later (for instance when you do laboration 2).

Groupings

Now you are ready to cluster! Open the Clustering Algorithms window from the Algorithms menu. It is set to rows and the K-Means algorithm is chosen. Look at its Properties. Then press the Apply button.

The texts (rows) are grouped into five clusters along the vertical axis. The columns have the same order as before. Press the two buttons to the left in the toolbar. The clusters are now separated by horizontal lines. You can see a difference in the distributions of words between the text clusters. Using the Pixel View you can try to decide what the clusters are about.

What is the order of the texts within the clusters? (Look at the K-Means Properties.)

To understand the cluster content better you can use the Relative Clusterer in the Clustering Algorithm window. Choose Columns! Explore the words in the corresponding word cluster for each text cluster. Do you get a better idea of what the clusters are about?

Don't forget that K-Means can give different results each time.

Grouping Edit Window

You can also look at the clusterings using the Grouping Edit and Group Edit windows. You open the former by pressing the E button in the Grouping panel. Here you can rename the clusters, reorder them and save the clustering to a file. Saved groupings can also be loaded again. Try it! Matrixes are not saved with their groupings. You have to save all groupings separately.

Group Edit Window

In the Grouping Edit window there is an E button for each group. Press one and the corresponding Group Edit window opens. Here you can resort the order of the groups and for some objects (the texts in our example, for instance) open them in the viewer. Looking at some of the texts in a cluster might give you further insight into what the cluster is about.

  • Groupings and Groups. Do you think the distributional patterns in the picture are useful when trying to understand the clustering? What about the relative clustering? Do the patterns help you to grasp the cluster contents when combined with the textual presentation? How important do you think the actual texts are to understand the cluster content? Any comments on the Grouping Edit and Group Edit windows?

Evaluation

Visual Evaluation

The difference in word distribution between the clusters can be considered a visual internal evaluation. The more obvious the difference in distribution the better the clustering. The relative clustering of words can help in observing the differences.

The original text files are stored in directories based on their manual category. You can construct a grouping that follows this categorization using the Location Grouper in the Clustering Algorithms window. (Remember to choose Rows.)

Choose the Color grouping along the rows to be the newly created location grouping. Now the shown and color grouping are the same. You can click the E button for the color grouping to see what the colors represent.

Change the Shown grouping back to be your K-Means clustering. Now the coloring lets you see the distribution of the categories over the clusters - a visual external evaluation.

Ordinary Evaluation

You can also achieve ordinary evaluation using the Evaluation option in the Tools menu. Choose a grouping to evaluate and a reference grouping and press Evaluate. (If you choose no reference grouping you get only internal measures.) Save your result to a file that you name appropriately. Open the file with your browser (that supports xml and xsl).

  • Evaluation. Are visual intrinsic evaluation helpful? Are visual extrinsic evaluation helpful? In what ways? How do they compare to ordinary evaluation?

Textual results

A grouping can be saved to a xml-file that can be viewed with a browser in the Export to text option in the Tools menu. Try this. You can click on any link in the browser to open clusters and texts, etc.

  • Results: Textual vs. Visual. What are the strengths and weaknesses of textual and visual result presentation? Which one du you prefer?

Improve Results

Now try to get a better result than the first by any means:

  • cluster to different numbers of clusters
  • remove more words and texts, or whole groups of words or texts
  • change weighting
  • etc.
Compete with the others - who can get the best result? Compare the different measures.

Is any of the newsgroups easier for the algorithm to find? Why do you think?

Remember to store your work (matrix and groupings) once in a while!

Report

This laboration is just an introduction. You are not supposed to write a long report, but we would like you to answer some questions. Your answers will be treated like a user questionnaire. It will help Magnus to get ideas on how to improve the program. The analysis of your answers (and perhaps some of your comments) will be reported in a report/paper on the program. For your own sake: do not write long essays...

Download the questionnaire file. Answer the "dotted" questions in the file while you do the laboration. Send it as an attachment to rosell at csc kth se

Some more questions:

  • Text Set.
    • What do you think of the text set?
    • How good result could you achieve? What did you do to get it?

  • The link between text and visualization.
    • What do you think about the link between the actual objects (the texts and words) and the visualization? Is it hard or easy to grasp?

  • General.
    • Was there anything you didn't understand? What didn't you understand immediately?
    • Is there anything that is missing? Any function(s) that you would like to have?
    • Do you think a program like this is useful? Would you like to use it again? For what purpose?
    • Any other comments.

^ Up to Clustering Course.

Sidansvarig: Magnus Rosell <rosell@csc.kth.se>
Uppdaterad 2009-10-15