Laboration 2 - Experiments
Download this new version of the text set:
example20ngB.zip. It contains set up files for some experiments.
Download the new version of the Infomat program
Use the latest version! (Check the course homepage also!)
Read the following:
(You may need to change the paths as described.)
- the new
especially Chapters 1 and 3.
Keep the manual close for the rest of the laboration.
Infomat as a Processing Tool
Chapter 3 of the manual describes Infomat as a processing tool.
You will use the command prompt functionality in this laboration.
Run all the examples in the "Command Prompt Usage" section.
Impact of preprocessing
Look at the example (example20ngB). It contains
the same example corpus as in laboration 1,
but with the really bad texts removed.
It also includes an experiment set up for comparing different preprocessing.
You find it in the directory
Look in the
so that the paths point to where they are on your computer.
(If you want to generate browsable result files or use the GUI
you also need change the path in the token file
as in laboration 1.)
Run the experiment as described in the manual. It can take quite some time,
depending on your computer.
Build a result table as described in the manual. Use the structure file
You have to change the
commonpath so that it points
to the right location on your computer.
You will probably find these results somewhat disappointing.
The results do not improve with preprocessing!
Why do you think? Discuss this in the report.
Consider, on the other hand, the huge difference in the size of
the representation. The time for each clustering of the
reduced representation is shorter.
It is very hard to know how to preprocess the texts to get
a good representation. There are very many parameters.
And many of them infuence each other.
Try some variations, for instance:
Try to change the properties for the proprocessing and/or stoplist.
Build new setups with
other preprocessing. Can you get better results
with a small representation?
(One way could be to use the GUI to do the preprocessing
and save the resulting matrix. Then you have to build
a new set up where the Experimentator file points to
that matrix file.)
Change the number of clusters for K-Means to 10
instead of 5. Remeber to remove the results from the result directory,
or create an entire new set up for this.
Change the number of repetitions in the properties file for
the Experimentator. Can you tell the results better apart then?
Include your own preprocessing in the result table
in the report. Do not try too many. You do not have
to find any tendency.
Set up your own experiments
Now, set up at least one own experiment.
Do not consider preprocessing.
Start from a matrix file you have created or use the tokenFile
with both filtering and stoplist, so the representation
You can investigate whatever you like.
There are several different methods to look at,
and all of them have several properties you might alter
(Look at the Properties files.).
Here are some suggestions:
- Compare the different algorithms: K-Means,
Bisecting K-Means and Random Clustering.
A random clustering is always a good reference.
- How does the result change with number of clusters?
- What is the effect of different parameters for K-Means?
(Initial grouping, condition for repetition/number of repetitions)
- Try different similarity measures.
- Try different weightings.
It may be a good idea to try your setup
on a smaller text set if it takes a long time to run,
since the Experimentator aborts if any part of the experiment set up
is incorrect. You could use the examples that are found in the Infomat
When you do your actual experiment
you probably need to set the "
Number of Repetitions"
for the Experimentator to at least 10
to be able to draw any conclusions from the results.
(When you test your set up it may be good to set them to fewer.)
If you want to do more
If you have fair java skills you could do any of the following
as your own experiment or in addition to it:
Write your own algorithm, weighting scheme, preprocessor, or whatever
you like and compare it to any of the available ones.
the quality increase between iterations in the K-Means algorithm.
doOneExperiment method in the Experimentator,
or start from the ExampleClusterer. "Steal" code from the Clusterer as well.
In the K-Means class I have temporarily put a method
that takes a IObjectGrouping and a SparseISimilarity
and returns a new grouping representing the state after one iteration.
Start from a random clustering, using the existing algorithm.
Save the result for each iteration in a separate directory.
Repeat and store the results in the same directories depending
on which iteration.
Build a result table or matrix using
Write a report on your work. It should not be very long
but include tables and/or plots of your results.
(If you investigate a property like the number of clusters,
it is nice to give the results in a graph - quality as a function
of the number of clusters.)
Discuss the results of (both) the experiments:
- Is any of the difference in results significant
(consider the standard deviation)?
- Why do you believe that certain parameters
improve/worsen results? (any problem with the data set?)
- In particular, why do you think the preprocessing
in the first experiment did not improve results?
- Discuss the appropriateness of using the
different measures to evaluate the result of
the parameters you investigate.
In particular, when is internal and external evaluation
All these questions are hard.
Do not be discouraged if you can not answer them.
Give it a go!
Send your report to rosell at csc kth se.
Up to Clustering Course.