2D5342, Databrytning / Knowledge Discovery and Data Mining, 4 p. Ograderade betyg.


Lectures&Reading

News Feb 11 2005:
There will be a repeat of previous weeks lecture outside my room at 9:15, Fridays.
Reading assignment: Feb 4: Jaynes: preface, Ch1 Ch2 (do not follow the 'proof' very closely -- it is the type of explanatory proof typical for theoretical physics), ch 4 & 5. See in folder 'course Package (kurspaket)' Take a look at Exercises 2.1 and 2.2.
The whole set of Jaynes notes is temporarily on nadas directory /misc/tcs/datamining/Jayne. I remove it when the book is out.

Lecture on Friday Jan 28, 15:15. Topic: Principles of Bayesian Statistics: Priors, Likelihoods, posteriors, decision analysis and estimates, with examples.
Friday Feb 4: Popular models: Generalization of Beta to Dirichlet, Graphical models and HMMs. Bayesian regression and feature selection.
Reading: Jaynes Ch2.
Friday Feb 11: Foundations and alternatives to Bayesian inference. Is Cox's/Jaynes' argument valid? Coherence. Extended probability. Robust Bayes analysis, Dempster/Shafer theory and fuzzy sets.
Reading: Bergmans Thesis, Ch 3 and 6.
Friday Feb 18: Markov Chain Monte Carlo.
Friday Feb 25: No Lecture.
Friday March 4: Chapman Kolmogorov equation and particle filters (sequential MCMC) Lecture starts at 15:30!!
Friday March 11: Time Series and latent states
Morning Lecture on Wednesday March 16

Friday March 18: Support vectors and kernel methodolygy

Future schedule will be announced here later.

COURSE STYLE

Two-hour lectures are given on Fridays, after noon. Language is English or Swedish, depending on participants. Students are expected to improve their understanding of the topics by reading recommended survey and research papers, and by solving at least some weekly bookwork/rider problems. Topics are agreed on during the first lecture, and students are encouraged to present material they master well, in class. Examination is by three parts(the mix is negotiated, only the first part is required):
Discussion of a submitted list of studied papers, not necessarily only among those recommended.
Discussion of solutions to weekly problems.
Discussion of solutions to open-ended homeworks.
Discussion of a project involving methods studied and (preferably) related to students research topic.

RECOMMENDED BOOKS:
One of the following is recommended as a useful reference and summary of the research area:

"Principles of Data Mining", MIT Press 2001, by David J. Hand, Heikki Mannila and Padhriac Smyth.

Michael Berthold and David J. Hand (eds): Intelligent Data Analysis, An Introduction
Springer-Verlag, 1999 ISBN 3-540-65808-4

E.T. Jaynes: Probability Theory, the Logic of Science (L. Bretthorst, Editor) Cambridge University Press 2003, ISBN:0521592712
(fascinating, sometimes controversial)

A. Gelman, J. Carlin, H Stern, D. Rubin: Bayesian data Analysis, Chapman & Hall, 2003 (Second Edition)
(very comprehensive, both practice and theory)

Benardo and Smith: Bayesian Theory, John Wiley & Sons, 1994, ISBN 0-471-92416-4
(Fairly advanced/theoretical)

Data Mining: Challenges and Opportunities, Idea Group Publishing, 2003 Editor: John Wang. ISBN 1591400511
( Your lecture notes is expanded version of Ch1 of this)

First Lecture: Course overview, presentation of participants, planning discussion. Lecture schedule will be defined based on participants interests. The following are among topics that have been covered in this course previously (not all in the same course instance-no more than half the topics can be conveniently covered in one course):
Bayesian and frequentist inference - relationships and philosophical issues
Overview of important statistical models
Multiple testing: FEW, FDR, specificity and power, ROC characterization
Model and predictor selection as decision problem: Loss functions, cross validation (Bernardo/Smith Ch6)
Bayesian view of exploratory data analysis and model adequacy(Gelman et al)
The overfitting problem and its appearances in Bayesian statistics.
Graphical statistical models and Bayesian networks
Unsupervised classification and clustering through mixture models
Markov Chain Monte Carlo methods in Bayesian inference
Exchangeability, Bayesian regression approaches, hierarchical models.
Sequential Markov models (particle filters)
Robust Bayes, Dempster/Shafer and soft computing
Support vector techniques: distribution-independent analysis

Read more on literature for above topics in previous course pages. The first part of the course material can be obtained from
Course Package

Home Works

Exercise problems

Some resources:

The Nada /misc directory has afs address /afs/nada.kth.se/misc or /afs/world/nada.kth.se/misc