The idea of this project is to apply classical statistical methods of multivariate data analysis to the problem of features selection and for prediction of features properties. Because neural networks can be applied to solve these problems as well, the purpose is to use classical statistical methods to be able to compare results obtained by using statistics and neural networks. The work has been carried out using real data concerning pulp and paper characteristics. These data were obtained from the Swedish Pulp and Paper Research Institute (STFI) and the whole work has been done in cooperation with this institute. The goal has been to predict paper features (called response variables) by pulp features (called explanatory variables) using as few pulp characteristics as possible. The problem is not trivial because explanatory variables as well as response variables are correlated. We have also observed a very strong correlation between these two sets of data.
One technique which has been used to reduce the number of explanatory variables is Principal Component Analysis (PCA). PCA transforms a set of correlated variables to a new set of uncorrelated variables. These new variables are a linear combination of the original variables. PCA lets us operate with a smaller number of variables and to see how many clusters (groups) it is possible to obtain from the original variables. Another technique which has been used is a method of discarding redundant original variables. According to this method the number of original variables has been reduced. This is meaningful in comparison with PCA, which uses a smaller number of variables, but it is still necessary to measure the same number of original variables. Yet another technique which has been used is based on a multiple regression model, commonly used to investigate the relationship between two sets of variables. Regression analysis is primarily concerned with predicting the mean value of the response variable on the basis of known values of explanatory variables. Multiple regression, complemented by the analysis of residuals is a very common method to predict response variables.
We have applied these techniques to a data set consisting of 10 pulp features and 12 paper characteristics. Most paper characteristics can be predicted by only 3 pulp features using a multiple regression model.
Next