A Robust Scalable Model Using Hybrid Approach for the Detection of the  Projected Outliers

H. Sadawarti; G.S. Kalra; Kamal Malik

doi:10.19026/rjaset.12.2712

Abstract

The abnormal and anomalous observations even in the advanced technological era proves to be the biggest jolt to the concerned industry. To reduce and eliminate the outliers from the massive data streams, it is important to accurately highlight them from the higher dimensional data which is itself very challenging. In this study, a Scalable outlier detection model is proposed which is robust enough to resist and detect the projected outliers that are lying at some lower dimensional subspaces. This model exploits the problem of curse of dimensionality which is very frequent in large data streams and massive datasets. Rapid distance and density based approaches are used and then the probability density is measured by Gaussian Mixture Model. Baye's Probability is applied to the final observations so as confirm them as the projected outliers.

Keywords:

GMM, KDD, projected outliers,

References

Aggarwal, C.C. and P.S. Yu, 2000. Finding generalized projected clusters in higher dimensional spaces. Proceeding of the ACM SIGMOID International Conference on Management of Data, pp: 70-81.
Agrawal, R., J. Gehrke, D. Gunopulos and P. Raghavan, 1998. Automatic subspace clustering of high dimensional data for data mining applications. Proceeding of the ACM SIGMOD International Conference on Management of Data, Seattle, WA, pp: 94-105.
CrossRef
Angiulli, F., S. Basta and C. Pizzuti, 2006. Distance-based detection and prediction of outliers. IEEE T. Knowl. Data En., 18(2): 145-160.
CrossRef
Atkinson, A.C., 1994. Fast very robust methods for the detection of multiple outliers. J. Am. Stat. Assoc., 89(428): 1329-1339.
CrossRef
Breunig, M.M., H.P. Kriegel, R.T. Ng and J. Sander, 2000. LOF: Identifying density-based local outliers. Proceeding of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, USA, pp: 93-104.
CrossRef
Danuser, G. and M. Stricker, 1998. Parametric model fitting: From inlier characterization to outlier detection. IEEE T. Pattern Anal., 20(3): 263-280.
CrossRef
Day, N.E., 1969. Estimating the components of a mixture of normal distributions. Biometrica, 56(3): 463-474.
CrossRef
Dempster, A.P., N.M. Liard and D.B. Rubin, 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B Met., 39: 1-38.
Duda, R.O. and P.E. Hart, 1973. Pattern Classification and Scene Analysis. Wiley, New York.
Ester, M., H.P. Kriegel, J. Sander and X. Xu, 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceeding of the 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press, Portland, OR, pp: 226-231.
Fidler, S., D. Skocaj and A. Leonardis, 2006. Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling. IEEE T. Pattern Anal., 28(3): 337-350.
CrossRef PMid:16526421
Hasselblad, V., 1966. Estimation of parameters for a mixture of normal distributions. Technometrics, 8(3): 431-444.
CrossRef
Hinneburg, A. and D.A. Keim, 1998. An efficient approach to clustering in large multimedia databases with noise. Proceeding of the 4th International Conference on Knowledge Discovery and Data Mining. New York City, pp: 58-65.
Knorr, E.M. and R.T. Ng, 1998. Algorithms for mining distance based outliers in large datasets. Proceeding of the 24th International Conference on Very Large Data Bases (VLDB, 1998), pp: 392-403.
Lazarevic, A., L. Ertoz, A. Ozgur, J. Srivastava and V. Kumar, 2013. A comparative study of anomaly detection schemes in network intrusion detection. Proceeding of the 3rd SIAM Conference on Data Mining.
Ramaswamy, S., R. Rastogi and K. Shim, 2000. Efficient algorithms for mining outliers from large data sets. Proceeding of the ACM SIGMOD International Conference on Management of Data (SIGMOD '00), pp: 427-438.
CrossRef PMid:10870986
Rocke, D.M. and D.L. Woodruff, 1996. Identification of outliers in multivariate data. J. Am. Stat. Assoc., 91(435): 1047-1061.
CrossRef
Roeder, K. and L. Wasserman, 1997. Practical Bayesian density estimation using mixtures of normals. J. Am. Stat. Assoc., 92(439): 894-902.
CrossRef
Rousseeuw, P.J. and A.M. Leory, 1987. Robust Regression and Outlier Detection. John Wiley and Sons, NY.
CrossRef
Rousseeuw, P.J. and K. Van Driessen, 1999. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41(3): 212-223.
CrossRef
Takeuchi, J. and K. Yamanishi, 2006. A unifying framework for detecting outliers and change points from time series. IEEE T. Knowl. Data En., 18(4): 482-492.
CrossRef
Wolfe, J.H., 1970. Pattern clustering by multivariate mixture analysis. Multivar. Behav. Res., 5(3): 329-350.
CrossRef PMid:26812701
Zhang, J., 2009. Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy. Ph.D. Thesis, Dalhousie University.

Research Journal of Applied Sciences, Engineering and Technology

A Robust Scalable Model Using Hybrid Approach for the Detection of the Projected Outliers

Abstract

Keywords:

References

Competing interests

Open Access Policy

Copyright



Journal Home \| Aim & Scope \| Author(s) Information \| Editorial Board \| MSP Download Statistics