bild
Skolan för
datavetenskap
och kommunikation

Clustering Course

Laboration 2 - Experiments

Preparation

  1. Download this new version of the text set: example20ngB.zip. It contains set up files for some experiments.
  2. Download the new version of the Infomat program here. Use the latest version! (Check the course homepage also!) Read the following:
    • the readme.txt file. (You may need to change the paths as described.)
    • the new infomatmanYYMMDD.pdf file, especially Chapters 1 and 3. Keep the manual close for the rest of the laboration.

Infomat as a Processing Tool

Chapter 3 of the manual describes Infomat as a processing tool. You will use the command prompt functionality in this laboration. Run all the examples in the "Command Prompt Usage" section.

Impact of preprocessing

Look at the example (example20ngB). It contains the same example corpus as in laboration 1, but with the really bad texts removed. It also includes an experiment set up for comparing different preprocessing. You find it in the directory preprocessingExperiments. Look in the properties subdirectory.

Change the "Result Path", "Stoplist" and "Token File" in the Experimentator_Properties.xml file, so that the paths point to where they are on your computer. (If you want to generate browsable result files or use the GUI you also need change the path in the token file as in laboration 1.)

Run the experiment as described in the manual. It can take quite some time, depending on your computer. Build a result table as described in the manual. Use the structure file in the tables subdirectory. You have to change the commonpath so that it points to the right location on your computer.

You will probably find these results somewhat disappointing. The results do not improve with preprocessing! Why do you think? Discuss this in the report. Consider, on the other hand, the huge difference in the size of the representation. The time for each clustering of the reduced representation is shorter.

It is very hard to know how to preprocess the texts to get a good representation. There are very many parameters. And many of them infuence each other. Try some variations, for instance:

  • Try to change the properties for the proprocessing and/or stoplist.
  • Build new setups with other preprocessing. Can you get better results with a small representation? (One way could be to use the GUI to do the preprocessing and save the resulting matrix. Then you have to build a new set up where the Experimentator file points to that matrix file.)
  • Change the number of clusters for K-Means to 10 instead of 5. Remeber to remove the results from the result directory, or create an entire new set up for this.
  • Change the number of repetitions in the properties file for the Experimentator. Can you tell the results better apart then?

Include your own preprocessing in the result table in the report. Do not try too many. You do not have to find any tendency.

Set up your own experiments

Now, set up at least one own experiment. Do not consider preprocessing. Start from a matrix file you have created or use the tokenFile with both filtering and stoplist, so the representation gets small.

You can investigate whatever you like. There are several different methods to look at, and all of them have several properties you might alter (Look at the Properties files.). Here are some suggestions:

  • Compare the different algorithms: K-Means, Bisecting K-Means and Random Clustering. A random clustering is always a good reference.
  • How does the result change with number of clusters?
  • What is the effect of different parameters for K-Means? (Initial grouping, condition for repetition/number of repetitions)
  • Try different similarity measures.
  • Try different weightings.

It may be a good idea to try your setup on a smaller text set if it takes a long time to run, since the Experimentator aborts if any part of the experiment set up is incorrect. You could use the examples that are found in the Infomat directory.

When you do your actual experiment you probably need to set the "Number of Repetitions" for the Experimentator to at least 10 to be able to draw any conclusions from the results. (When you test your set up it may be good to set them to fewer.)

If you want to do more

If you have fair java skills you could do any of the following as your own experiment or in addition to it:
  • Write your own algorithm, weighting scheme, preprocessor, or whatever you like and compare it to any of the available ones.

  • Investigate the quality increase between iterations in the K-Means algorithm. Alter the doOneExperiment method in the Experimentator, or start from the ExampleClusterer. "Steal" code from the Clusterer as well.

    In the K-Means class I have temporarily put a method oneIter that takes a IObjectGrouping and a SparseISimilarity and returns a new grouping representing the state after one iteration. Start from a random clustering, using the existing algorithm.

    Save the result for each iteration in a separate directory. Repeat and store the results in the same directories depending on which iteration. Build a result table or matrix using the ExperimentResultGenerator.

Report

Write a report on your work. It should not be very long, but include tables and/or plots of your results.

(If you investigate a property like the number of clusters, it is nice to give the results in a graph - quality as a function of the number of clusters.)

Discuss the results of (both) the experiments:

  • Is any of the difference in results significant (consider the standard deviation)?
  • Why do you believe that certain parameters improve/worsen results? (any problem with the data set?)
  • In particular, why do you think the preprocessing in the first experiment did not improve results?
  • Discuss the appropriateness of using the different measures to evaluate the result of the parameters you investigate. In particular, when is internal and external evaluation appropriate?
All these questions are hard. Do not be discouraged if you can not answer them. Give it a go!

Send your report to rosell at csc kth se.


^ Up to Clustering Course.

Sidansvarig: Magnus Rosell <rosell@csc.kth.se>
Uppdaterad 2008-10-02