Kurs 2D5342 Uppläggning 1997

First lecture: August 22, 13:15 - 15:00 in room 1537, Nada, Osquars Backe 2.

Course Book (Comprehensive and excellent!!)
Order from MIT Press

Course mailing list

The lectures will be of two kinds: Foundational areas like the basis of Bayesian analysis, unsupervised classification, MDL, soft computing, will be covered in seminar lectures by me and invited speakers. Applications in areas of interest to students will be presented by students and/or invited speakers and discussed in class. There will be less emphasis on a cookbook of concrete computational procedures, since once the basic theory and a range of application areas are understood, it is easy to look up the concrete computational procedures required.

Literature

Tentative reading list:

A: General Methodology :

  1. In Fayyad et al: Ch 1, 2, 4, 14, 23
  2. H. Mannila: Methods and problems in Data Mining,
    ICDT 97, LNCS 1186, pp 41-55
  3. T. R. Willemain, "Model Formulation: What experts think about and when," Operations Research, Vol. 43, no. 6, pp.916-932, Nov/Dec. (1995).

B: Bayesian methods and Markov Chain Monte Carlo :

  1. In Fayyad et al: Ch 3, 11, 13, 20
  2. E. T. Jaynes: Probability theory: The logic of Science, Ch 1&2.
  3. D.S. Sivia: Ch2: Parameter estimation and Ch 6: Non-parametric methods, in: Data Analysis, A Bayesian Tutorial Clarendon Press Oxford 1996
  4. Hastings: Monte Carlo sampling methods using Markov chains, and their applications. Biometrika 57(1970) p 97-109
  5. R.M.Neal: Probabilistic Inference using Markov Chain Monte Carlo Methods. TR CRG-TR-93-1, University of Toronto, CS Department
  6. Bibliography from B 5.

C: Soft Computing:

  1. Azvine, Azarmi and Tsui: An Introduction to Soft Computing - A Tool for building Intelligent Systems LNCS1198, 1997, pp 191-210.
  2. Baldwin, Martin: Basic concepts of a Fuzzy Logic Data Browser with applications. Software Agents and Soft Computing, LNCS1198, 1997, pp 211-241.
  3. Xiaohua Hu, Nick Cerone: Mining Knowledge Rules from Databases: A Rough Set Approach IEEE 1996 Data Engineering Conference, pp 96-105.
  4. SummarySQL - A Fuzzy Tool For Data Mining
    Dan Rasmussen, Ronald R. Yager, Intelligent Data Analysis, 1(1)(1997)
  5. Heinonen, Mannila: Attribute oriented induction and conceptual clustering,
    University of Helsinki Dept Computer Science, report C-1996-2.

D: Stochastic Complexity and Classification (unsupervised and supervised)

  1. In Fayyad et al: Ch 6, 7, 19
  2. J. Rissanen: Stochastic Complexity(with discussion), J.R. Statist. Soc B(1987) 49(3) pp 223-239 and 252-265.
  3. C.S. Wallace and P.R. Freeman: Estimation and inference by Compact Coding(with discussion). J.R. Statist. Soc B(1987) 49(3) pp 240-265.
  4. Cullen Schaffer: Selecting a Classification Method by Cross-Validation (MLJ 1993)
  5. G.I. Webb: Further Experimental Evidence against the Utility of Occams Razor,
    Journal of AI research 4(1996) 397-417.
  6. S.P. Curram & J. Mingers. "Neural Networks, Decision Tree Induction and Discriminant Analysis: An Empriical Comparison." Journal of the Operational Research Society. 45(4) 1994 pp 440-450.
  7. Mats Gyllenberg, Timo Koski and Martin Verlaan: Classification of binary vectors by stochastic complexity. J Multivariate ananlysis 62(1997)
  8. H.G. Gyllenberg, M. Gyllenberg, T. Koski, T Lund: Stochastic complexity as a taxonomic tool. TRITA-MAT-97-MS-02, KTH.
  9. M. Gyllenberg, T. Koski, T. Lahti: Associative memories for clusters of binary vectors using MATLAB neural network toolbox. Proc of the Nordic MATLAB conference.

E:Time series and prediction :

  1. In Fayyad et al: Ch 9, 22
  2. Casdagli, Des Jardins, Eubank, Farmer, Gibson, Hunter, Theiler: Nonlinear modeling of Chaotic Time Series: Theory and Applications.
  3. (to read before Casdagli et al): Ch 1 of: Time Series Prediction: Forecasting the Future and Understanding the Past. Weigend, A. S., and N. A. Gershenfeld (Eds.) (1994) Santa Fe Institute Studies in the Sciences of Complexity XV. (Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis, Santa Fe, NM, May 1992.) Reading, MA: Addison-Wesley.

F: Spatial applications:

  1. "Fast Spatio-Temporal Data Mining of Large Geophysical Datasets", The First International Conference on Knowledge Discovery and Data Mining, Montreal, Quebec, Canada, Aug 1995.
  2. Bettini, Wang, Jajodia: Testing Complex Relationships Inviolving Multiple Granularities and its Application to Data Mining:PODS 96
  3. Wang, Chirn, Marr, Shapiro, Shasha, Zhang: Combinatorial Pattern Discovery for Scientific Data: Some preliminary results SIGMOD 94 115-125.
  4. Li, Yu, Castelli: Hierarchyscan: A hierarchical Similarity Search Algorithm for Databases of Long Sequences IEEE 96 Data Engineering
  5. Gray, Bosworth, Layman, Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-totals. IEEE 96 Data Engineering, pp 152-159

G: Visualization:

  1. Abram, Treinish: An extended data-flow architecture for data analysis and visualization. IEEE 95 Visualization, pp 263-270.
  2. Martin, Ward: High Dimensional Brushing for Interactive Exploration of Multivariate Data. IEEE 95 Visualization, pp 271-278.
  3. Buja et al (1996). Interactive High-Dimensional Data Visualization. Journal of Computation and Graphical Statististics. Vol 5, No. 1.

H: Mediation and Brokerage:

  1. Calmet, Debertin, Jekutsch, Schu: An executable graphical representation of mediatory information systems. IEEE 96 Data Engineering, pp 124-131.
  2. Papakonstantinou, Garcia-Molina, Ullman: MedMaker: A mediation system based on declarative specifications. IEEE 96 Data Engineering, pp 132-141.
  3. "Integrating Distributed Object Management into EOS", Geo Info Systems, 5(5):58-59, May 1995.
  4. "The Conquest Modeling Framework for Geoscientific Data", UCLA CSD Technical Report #940039, Oct 1994.

Optional reading:

Schedule, See news file!!

Examination.

Participating students can choose from the reading list and define an individual mix of the following examination forms, depending on individual preferences and learning needs, to a total assesed by the examiner to 4 credits(poäng). One project, possibly a small one, must be included.

The project could involve data used in your research project (check with your project leader).

WWW pointers

IDA journal
Irvine ML repository
Statistics (CMU) Home Page
Data Mining Foundation
Helsinki Data Mining Group
DASL Case repository
UCLA KDD group
Nasa report and program archives
Earth Science Resources
Datasets for Data Mining
Order map data
SCBs data online
SKICAT
Geospatial Data Resources
Some US Gis data
Astrophysics Data Sets

Bayesian Inference Demo D-SIDE
IVEE
WinViz visualization system
XmdvTool
MacFactory

BUGS archive

Related Courses Elsewhere

KDDM at RPI
Spatial Statistics at Wisconson
Spatial Statistics
Statistics Refresher

Haiku