Course
Book (Comprehensive and excellent!!)

Order from MIT
Press

The lectures will be of two kinds: Foundational areas like the basis of Bayesian analysis, unsupervised classification, MDL, soft computing, will be covered in seminar lectures by me and invited speakers. Applications in areas of interest to students will be presented by students and/or invited speakers and discussed in class. There will be less emphasis on a cookbook of concrete computational procedures, since once the basic theory and a range of application areas are understood, it is easy to look up the concrete computational procedures required.

- U.M Fayyad, G. Piatetsky-Shapiro, P. Smyth R Uthurusamy (Eds)

Advances in Knowledge Discovery and Data Mining.

AAAI Press, Menlo Park, CA 1996. ISBN 0-262-56097-6

- IEEE Expert Oct 1996
- Research articles - see below

**A: General Methodology : **

- In Fayyad et al: Ch 1, 2, 4, 14, 23
- H. Mannila: Methods and problems in Data Mining,

ICDT 97, LNCS 1186, pp 41-55 - T. R. Willemain, "Model Formulation: What experts think about and when," Operations Research, Vol. 43, no. 6, pp.916-932, Nov/Dec. (1995).

**B: Bayesian methods and Markov Chain Monte Carlo : **

- In Fayyad et al: Ch 3, 11, 13, 20
- E. T. Jaynes: Probability theory: The logic of Science, Ch 1&2.
- D.S. Sivia: Ch2: Parameter estimation and Ch 6: Non-parametric methods, in: Data Analysis, A Bayesian Tutorial Clarendon Press Oxford 1996
- Hastings: Monte Carlo sampling methods using Markov chains, and their applications. Biometrika 57(1970) p 97-109
- R.M.Neal: Probabilistic Inference using Markov Chain Monte Carlo Methods. TR CRG-TR-93-1, University of Toronto, CS Department
- Bibliography from B 5.

**C: Soft Computing: **

- Azvine, Azarmi and Tsui: An Introduction to Soft Computing - A Tool for building Intelligent Systems LNCS1198, 1997, pp 191-210.
- Baldwin, Martin: Basic concepts of a Fuzzy Logic Data Browser with applications. Software Agents and Soft Computing, LNCS1198, 1997, pp 211-241.
- Xiaohua Hu, Nick Cerone: Mining Knowledge Rules from Databases: A Rough Set Approach IEEE 1996 Data Engineering Conference, pp 96-105.
- SummarySQL - A Fuzzy Tool For Data Mining

Dan Rasmussen, Ronald R. Yager, Intelligent Data Analysis, 1(1)(1997) - Heinonen, Mannila: Attribute oriented induction and conceptual
clustering,

University of Helsinki Dept Computer Science, report C-1996-2.

**D: Stochastic Complexity and Classification (unsupervised and
supervised)
**

- In Fayyad et al: Ch 6, 7, 19
- J. Rissanen: Stochastic Complexity(with discussion), J.R. Statist. Soc B(1987) 49(3) pp 223-239 and 252-265.
- C.S. Wallace and P.R. Freeman: Estimation and inference by Compact Coding(with discussion). J.R. Statist. Soc B(1987) 49(3) pp 240-265.
- Cullen Schaffer: Selecting a Classification Method by Cross-Validation (MLJ 1993)
- G.I. Webb: Further Experimental Evidence against the Utility of
Occams
Razor,

Journal of AI research 4(1996) 397-417. - S.P. Curram & J. Mingers. "Neural Networks, Decision Tree Induction and Discriminant Analysis: An Empriical Comparison." Journal of the Operational Research Society. 45(4) 1994 pp 440-450.
- Mats Gyllenberg, Timo Koski and Martin Verlaan: Classification of binary vectors by stochastic complexity. J Multivariate ananlysis 62(1997)
- H.G. Gyllenberg, M. Gyllenberg, T. Koski, T Lund: Stochastic complexity as a taxonomic tool. TRITA-MAT-97-MS-02, KTH.
- M. Gyllenberg, T. Koski, T. Lahti: Associative memories for clusters of binary vectors using MATLAB neural network toolbox. Proc of the Nordic MATLAB conference.

**E:Time series and prediction : **

- In Fayyad et al: Ch 9, 22
- Casdagli, Des Jardins, Eubank, Farmer, Gibson, Hunter, Theiler: Nonlinear modeling of Chaotic Time Series: Theory and Applications.
- (to read before Casdagli et al): Ch 1 of: Time Series Prediction: Forecasting the Future and Understanding the Past. Weigend, A. S., and N. A. Gershenfeld (Eds.) (1994) Santa Fe Institute Studies in the Sciences of Complexity XV. (Proceedings of the NATO Advanced Research Workshop on Comparative Time Series Analysis, Santa Fe, NM, May 1992.) Reading, MA: Addison-Wesley.

**F: Spatial applications: **

- "Fast Spatio-Temporal Data Mining of Large Geophysical Datasets", The First International Conference on Knowledge Discovery and Data Mining, Montreal, Quebec, Canada, Aug 1995.
- Bettini, Wang, Jajodia: Testing Complex Relationships Inviolving Multiple Granularities and its Application to Data Mining:PODS 96
- Wang, Chirn, Marr, Shapiro, Shasha, Zhang: Combinatorial Pattern Discovery for Scientific Data: Some preliminary results SIGMOD 94 115-125.
- Li, Yu, Castelli: Hierarchyscan: A hierarchical Similarity Search Algorithm for Databases of Long Sequences IEEE 96 Data Engineering
- Gray, Bosworth, Layman, Pirahesh: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-totals. IEEE 96 Data Engineering, pp 152-159

**G: Visualization: **

- Abram, Treinish: An extended data-flow architecture for data analysis and visualization. IEEE 95 Visualization, pp 263-270.
- Martin, Ward: High Dimensional Brushing for Interactive Exploration of Multivariate Data. IEEE 95 Visualization, pp 271-278.
- Buja et al (1996). Interactive High-Dimensional Data Visualization. Journal of Computation and Graphical Statististics. Vol 5, No. 1.

**H: Mediation and Brokerage: **

- Calmet, Debertin, Jekutsch, Schu: An executable graphical representation of mediatory information systems. IEEE 96 Data Engineering, pp 124-131.
- Papakonstantinou, Garcia-Molina, Ullman: MedMaker: A mediation system based on declarative specifications. IEEE 96 Data Engineering, pp 132-141.
- "Integrating Distributed Object Management into EOS", Geo Info Systems, 5(5):58-59, May 1995.
- "The Conquest Modeling Framework for Geoscientific Data", UCLA CSD Technical Report #940039, Oct 1994.

Participating students can choose from the reading list and define an individual mix of the following examination forms, depending on individual preferences and learning needs, to a total assesed by the examiner to 4 credits(poäng). One project, possibly a small one, must be included.

- Presentations at Seminars,
- Homework,
- Project.

The project could involve data used in your research project (check with your project leader).

IDA journal

Irvine
ML
repository

Statistics (CMU) Home Page

Data Mining Foundation

Helsinki
Data Mining Group

DASL Case repository

UCLA KDD
group

Nasa report
and
program archives

Earth
Science
Resources

Datasets for Data
Mining

Order map data

SCBs data
online

SKICAT

Geospatial
Data Resources

Some US Gis
data

Astrophysics Data Sets

Bayesian
Inference Demo D-SIDE

IVEE

WinViz visualization system

XmdvTool

MacFactory

BUGS
archive

KDDM at RPI

Spatial
Statistics
at Wisconson

Spatial
Statistics

Statistics
Refresher