2D5342, Databrytning / Knowledge Discovery and Data Mining, 4 p. Ograderade betyg.


I have changed the lecture plan somewhat. On Friday

March 7: I will wrap up non-linear factor analysis (Valponen-Karhunen paper) and explain the theory behind (VC dimension, PAC learning, Kernels) and practical significance of Support Vector Methods.

March 14: Lars Forsberg talks (1h) on Independent Component Analysis in brain imaging.
Jesper Fredriksson talks(1h) on analysis of functional brain images.


This course will be given in period 3 2003. Start Jan 17, 2003. Fridays 13:15-15 in room 1537. Lectures will be in Swedish or English.

RECOMMENDED BOOKS:
One of the following is recommended as a useful reference and summary of the research area:
"Principles of Data Mining", MIT Press 2001, by David J. Hand, Heikki Mannila and Padhriac Smyth.
Michael Berthold and David J. Hand (eds): Intelligent Data Analysis, An Introduction
Springer-Verlag, 1999 ISBN 3-540-65808-4
E.T. Jaynes: Probability Theory, the Logic of Science (L. Bretthorst, Editor) Cambridge University Press 2003, ISBN:0521592712
Data Mining: Challenges and Opportunities, Idea Group Publishing, 2003 Editor: John Wang.

Preliminary information:

Note: It is not compulsory to make a project, but it is possible. Examination is negotiated between me and students.

Lectures&Reading

First Lecture: Course overview, planning discussion. Lecture schedule will be defined based on participants interests. The following are among topics that have been covered in this course previously (not all in the same course instance-no more than half the topics can be conveniently covered in one course):
Bayesian and frequentist inference - relationships and philosophical issues
Overview of important statistical models
Multiple testing: FEW, FDR, specificity and power, ROC characterization
visualization of non-geometric data
Bayesian view of exploratory data analysis and model adequacy
Graphical statistical models and Bayesian networks
Unsupervised classification and clustering through mixture models
Markov Chain Monte Carlo methods in Bayesian inference
Causality and confounding
Sequential Markov models (particle filters)
Finite Set Statistics
Support vector techniques: distribution-independent analysis
False Discovery Rate control and related issues
Independent Component Analysis - Bayesian interpretation

Read more on literature for above topics in previous course pages. The first part of the course material can be obtained from
Course Package

Home Works


Week4 Reading: Jaynes Preface, Ch1, Ch2 (you should skip the detailed derivations of product and sum rules. New and better derivations exist), Ch4, Ch5. If you are really fascinated: The whole set of Jaynes notes is temporarily on nadas directory /misc/tcs/datamining/Jayne. I remove it when the book is out.
Jan 24 Lecture: Bayesianism, inference. Normative claims.

Week5 Reading: A survey of Bayesian Data mining (Sections 1-3, skip what you already read in Jaynes). Gelman's papers. If you have time and interest: take a look at the Xgobi system, or Spotfire if you have a licence. (Spotfire is basically a commercialization of xgobi, but with a lot of functionality added.)
Jan 31 Lecture: Conjugate distributions, Beta and Dirichlet distributions, independence hypotheses. Validity of Cox, de Finetti and Savage's arguments(Jaynes Ch2).

Week6 reading: MCMC: Bergmans thesis ch 3.1-3 and ch 6. If you have time: Take a look at BUGS or BASSIST. You can also start looking at homeworks 0 and 1.
Feb 7 Lecture: Rubin/Gelman ideas on integrating explorative data analysis and testing into Bayesian analysis. Hierarchical models and shrinkage. MCMC theory and practice.
Week7 reading: Rest of Survey of Data Mining, particularly classification and graphical models. Freedmans papers.
Feb 14 Lecture: Theory and applications of classification by graphical models. Relationship to Causality.


Week8 reading: Cheeseman&Stutz Autoclass paper. Time Series (Valponen, Casdagli and Gershenfeld papers) Feb 21: !!!!!!! Lecture starts at 15:15 !!!!!!!!!!!!!
Standard classification model. Time series and latent state identification.
Feb 28 Lecture: time series continued. (Lecture at 13:15)

Week 10 reading: SVM papers.

March 7: I will wrap up non-linear factor analysis (Valponen-Karhunen paper) and explain the theory behind (VC dimension, PAC learning, Kernels) and practical significance of Support Vector Methods.

March 14: Lars Forsberg talks (1h) on Independent Component Analysis in brain imaging.
Jesper Fredriksson talks(1h) on analysis of functional brain images.

March 21: tba

Some resources:

The Nada /misc directory has afs address /afs/nada.kth.se/misc or /afs/world/nada.kth.se/misc