2D5342, Databrytning / Knowledge Discovery and Data Mining, 4 p. Ograderade betyg.

Lecture at April 7 cancelled/replaced by CAS seminar on a large robotics project.

Lectures are in room 1537 plan 5, Fridays at 13:15

Rehearsal/question answering outside my room, plan 4, Fridays at 9:15

Tentative Lecture Schedule

Next Lecture:
Feb 27: Finish Ch 2 (not 2.1.9)
Start HW 1
Jaynes Ch 2 if you are interested
Matlab MCMC code for piecewise constant intensity

Feb 10 13:15 in 1537 SA lecture Notes up to (not including) 2.4; Exercises 1, 3, 4 and 5. You may also start on Home Work 0

Feb 3, Friday 13:15 in 1535. Recommended reading: SA Lecture Notes up to and including section 2.1.2; Exercise 2.

Jaynes Lecture Notes: Ch1, Ch 4, Ch 5.

Till kursomgången våren 2006 kommer jag att använda en väsentligt utvidgad version av kurskompendiet 'A survey of Bayesian Data Mining', som bytt namn till 'Statistical Methods in Applied Computer Science'. I utvidgningen ingår beskrivningar av de tekniker som tidigare bara behandlats med utdelade artiklar - som många tyckt vara svårlästa - och föreläsningar. Det bör göra det lättare att orientera sig i de olika metoder som genomgås, särskilt eftersom både index och innehållsförteckning tillkommit.

Det nya kurskompendiet är under utarbetning och kommer i början av december att ligga på kurssidorna. Just nu är det gamla kompendiet borttaget (men går lätt att hitta med Google), för att ingen ska skriva ut ett inaktuellt kompendium av misstag. Under 2007 går kursen i ny form, och de preliminära kursmålen finns här.


First Lecture on Friday Jan 27 2006, 15:15, in room 1535 (opposite 1537). Nada plan 5, Osquars Backe 2.
Course overview and administration. Checking the schedule: Probably there will be two lectures every week, where the first lecture each week is a rehearsal of the previous weeks last lecture, with more problem solving. Presentation of participants.

Future schedule will be announced here later.


Two-hour lectures are given on Fridays, after noon. Language is English or Swedish, depending on participants. Students are expected to improve their understanding of the topics by reading recommended survey and research papers, and by solving at least some weekly bookwork/rider problems. Topics are agreed on during the first lecture, and students are encouraged to present material they master well, in class. Examination is by three parts(the mix is negotiated, only the first part is required):
Discussion of a submitted list of studied papers, not necessarily only among those recommended.
Discussion of solutions to weekly problems.
Discussion of solutions to open-ended homeworks.
Discussion of a project involving methods studied and (preferably) related to students research topic.

One of the following is recommended as a useful reference and summary of the research area:

"Principles of Data Mining", MIT Press 2001, by David J. Hand, Heikki Mannila and Padhriac Smyth.

Michael Berthold and David J. Hand (eds): Intelligent Data Analysis, An Introduction
Springer-Verlag, 1999 ISBN 3-540-65808-4

Pang-Ning Tan, Michael Steinbach, Vipar Kumar:
Introduction to Data Mining
Addison Wesley April 2005

E.T. Jaynes: Probability Theory, the Logic of Science (L. Bretthorst, Editor) Cambridge University Press 2003, ISBN:0521592712
(fascinating, sometimes controversial)

A. Gelman, J. Carlin, H Stern, D. Rubin: Bayesian data Analysis, Chapman & Hall, 2003 (Second Edition)
(very comprehensive, both practice and theory)

Benardo and Smith: Bayesian Theory, John Wiley & Sons, 1994, ISBN 0-471-92416-4
(Fairly advanced/theoretical)

Data Mining: Challenges and Opportunities, Idea Group Publishing, 2003 Editor: John Wang. ISBN 1591400511
( Your lecture notes is expanded version of Ch1 of this)

First Lecture: Course overview, presentation of participants, planning discussion. Lecture schedule will be defined based on participants interests. The following are among topics that have been covered in this course previously (not all in the same course instance-no more than half the topics can be conveniently covered in one course):
Bayesian and frequentist inference - relationships and philosophical issues
Overview of important statistical models
Multiple testing: FEW, FDR, specificity and power, ROC characterization
Model and predictor selection as decision problem: Loss functions, cross validation (Bernardo/Smith Ch6)
Bayesian view of exploratory data analysis and model adequacy(Gelman et al)
The overfitting problem and its appearances in Bayesian statistics.
Graphical statistical models and Bayesian networks
Unsupervised classification and clustering through mixture models
Markov Chain Monte Carlo methods in Bayesian inference
Exchangeability, Bayesian regression approaches, hierarchical models.
Sequential Markov models (particle filters)
Robust Bayes, Dempster/Shafer and soft computing
Support vector techniques: distribution-independent analysis

Read more on literature for above topics in previous years course pages. The first part of the course material can be obtained from
Course Package

Home Works

Some resources:

The Nada /misc directory has afs address /afs/nada.kth.se/misc or /afs/world/nada.kth.se/misc