An Inductive Data Mining System Framework. Godfrey Onwubolu

Abstract. This paper presents a framework for a unified inductive data mining system based on group method of data handling (GMDH) for modeling, predicting, clustering, and classification for mining fuzzy, noisy and large datasets encountered in real-life applications. In data mining of real-life problems, some major issues that arise include missing data, and very large variables defining the data. For handling missing and noisy datasets, a fast Fourier transform (FFT) signature-based approach integrated with expectation maximization-principal component analysis (EM-PCA) is proposed which automatically reduces large variable dataset to a smaller dimension and consequently results in more a flexible and responsive data mining system for dealing with practical real-life problems. The paper presents a unified inductive data-mining system which is capable of solving data mining functions which differ from existing well known deductive modeling schemes.

Keywords. Inductive modeling, GMDH, data mining, FFT-EM-PCA signature-reduction algorithm, complex systems.

References.

1. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., From data mining to knowledge discovery in databases, American Association for Artificial Intelligence, 1996, 37-53.

2. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., Eds. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge, Mass., 1996.

3. Madala, H. R., and Ivakhnenko, A. G., Inductive Learning Algorithms for Complex Systems Modeling, CRC Press Inc., 1994, p.384.

4. Litte, R. J.A., and Rubin, D.B., Statistical Analysis with Missing Data, New York: John Wiley & Sons, 1987

5. Yeh, R-L, Liu, C., Shia, B-C., Cheng, Y-T., Huwang, Y-F., Imputing manufacturing material in data mining, J. Intelligent Manufacturing, 19, 2008, 109-118.

6. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1—38, 1977.

7. B. S. Everitt. An Introduction to Latent Variable Models. Chapman and Hill, London, 1984.

8. Zoubin Ghahramani and Geoffrey Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96- 1, Dept. of Computer Science, University of Toronto, Feb. 1997.

9. Afify, A. A, Dimov, S. S., Naim, M., Valeva, V., Shukla, V., Data mining: a tool for detecting cyclic disturbances in supply networks, Proc. IMechE Vol. 221 Part B: J. Engineering Manufacture, 2007, 1771-1785.

10. Lorence, E., N., Atmospheric predictability is revealed by naturally occurring analogues, J. Atmospheric Science, 1969, No. 4, pp 636-646 Modeling, CRC Press Inc., 1994, p.384.

11. Lemke F., Mueller J.-A., Self-Organizing Data Mining for a portfolio trading system, Journal of Computational Intelligence in Finance, 26(3), 1997, 12-26.

12. Mueller J.-A., Lemke F. Self-Organizing Data Mining. Extracting Knowledge From Data. Trafford Publishing, Canada, 2003.

13. Ivakhnenko, G., Short-term processes forecasting by analogues complexing GMDH algorithm, Proceedings of 2nd

International Conference on Inductive Modeling 2008, September 15-19, 2008, Kyiv, Ukraine, 241-245.

14. van 't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D, Hart, A.M.H, Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R. and Friend, S.H.: Gene expression profiling predicts clinical outcome of breast cancer, Letters to Nature, Nature, vol. 415, pp. 530-536, 2002.

15. Onwubolu, G. C., (ed.), Hybrid Self-Organizing Modeling Systems, Springer-Verlag, Heidelberg, Germany, 2009:  http://www.springer.com/engineering/book/978-3-642-01529-8

Last modified by Gleb on 10/29/09 14:29:21 (2 years ago)

Attachments