Research Article | OPEN ACCESS
Distributed Anomaly Detection Over Big Data
Mohamed Sakr, Walid Atwa and Arabi Keshk
Faculty of Computers and Information, Shebeen El Kom, Menofia, 32511, Egypt
Research Journal of Applied Sciences, Engineering and Technology 2019 2:77-87
Received: December 27, 2018 | Accepted: February 17, 2019 | Published: March 15, 2019
Abstract
This study aims to solve the problem of detecting anomalies in big data. A border-based Gird Partition (BGP) algorithm was proposed. The BGP algorithm focuses on calculating the Local Outlier Factor (LOF) for big data in a distributed environment. It splits the data into intersected subsets, then allocates these subsets to the slave nodes in a distributed environment. Some parts of these subsets are replicated between slave nodes. The slave nodes calculate the LOF for each subset that it owns. The splitting of the data between the slave nodes is done in grid-based without considering the size of the data that will be assigned to every slave node. The BGP algorithm results in un-balanced distribution of the subsets between slave nodes. To overcome this problem a modification on the BGP algorithm is proposed to take in consideration the size of the data that will be assigned to every slave node. The modified algorithm called Balanced boarder-based Gird Partition algorithm (BBGP). BBGP splits the data between the slave node equally. So that all the slave nodes will do balanced processing for calculating the LOF for the data. In the end, we evaluate the performance of the two algorithms through a series of simulation experiments over real data sets.
Keywords:
Anomaly detection, big data, distributed environment, local outlier factor, outlier detection,
References
-
Breunig, M.M., H.P. Kriegel, R.T. Ng and J. Sander, 2000. LOF: Identifying density-based local outliers. Proceeding of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00), pp: 93-104.
CrossRef -
Cao, F., M. Ester, W. Qian and A. Zhou, 2006. Density-based clustering over an evolving data stream with noise. Proceeding of the 6th SIAM International Conference on Data Mining, pp: 328-339.
CrossRef -
Hawkins, D.M., 1980. Identification of Outliers. Chapman and Hall, London, Vol. 11.https://doi.org/10.1007/978-94-015-3994-4
CrossRef PMid:6898078 -
Knox, E.M. and R.T. Ng, 1998. Algorithms for mining distance-based outliers in large datasets. Proceeding of the 24th International Conference on Very Large Data Bases (VLDB '98), pp: 392-403.
-
Ramaswamy, S., R. Rastogi and K. Shim, 2000. Efficient algorithms for mining outliers from large data sets. Proceeding of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00), pp: 427-438.
CrossRef PMid:10870986 -
Aggarwal, C.C. and P.S. Yu, 2001. Outlier detection for high dimensional data. Proceeding of the 2001 ACM SIGMOD International Conference on Management of Data, pp: 37-46.
CrossRef -
Aggarwal, C.C. and P.S. Yu, 2008. Outlier detection with uncertain data. Proceeding of the 2008 SIAM International Conference on Data Mining, pp: 483-493.
CrossRef -
Aggarwal, C.C., J. Han, J. Wang and P.S. Yu, 2003. A framework for clustering evolving data streams. Proceeding of the 29th International Conference on Very Large Data Bases (VLDB '03), 29: 81-92.
CrossRef PMid:12693467 -
Bai, M., X. Wang, J. Xin and G. Wang, 2016. An efficient algorithm for distributed density-based outlier detection on big data. Neurocomputing, 181: 19-28.
CrossRef -
Dheeru, D. and E. Karra Taniskidou, 2017. UCI Machine Learning Repository.
-
Esoteric, n.d. 2018. Kryonet. [Online] Retrieved form: https://github.com/EsotericSoftware/kryonet. (Accessed on: January 1, 2018)
Direct Link -
Guha, S., A. Meyerson, N. Mishra, R. Motwani and L. O'Callaghan, 2003. Clustering data streams: Theory and practice. IEEE T. Knowl. Data En., 15: 515-528.
CrossRef -
Jin, W., A.K.H. Tung, J. Han and W. Wang, 2006. Ranking outliers using symmetric neighborhood relationship. In: Ng, W.K., M. Kitsuregawa, J. Li and K. Chang, (Eds.): Advances in Knowledge Discovery and Data Mining. PAKDD, 2006. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 3918: 577-593.
CrossRef -
Kontaki, M., A. Gounaris, A.N. Papadopoulos, K. Tsichlas and Y. Manolopoulos, 2011. Continuous monitoring of distance-based outliers over data streams. Proceeding of the IEEE 27th International Conference on Data Engineering, pp: 135-146.
CrossRef -
Lozano, E. and E. Acufia, 2005. Parallel algorithms for distance-based and density-based outliers. Proceeding of the 5th IEEE International Conference on Data Mining (ICDM'05), pp: 4.
Direct Link -
Rajasegarar, S., C. Leckie and M. Palaniswami, 2008. Anomaly detection in wireless sensor networks. IEEE Wirel. Commun., 15(4): 34-40.
CrossRef -
Tang, J., Z. Chen, A.W.C. Fu and D.W. Cheung, 2002. Enhancing effectiveness of outlier detections for low density patterns. In: Chen, M.S., P.S. Yu and B. Liu (Eds.), Advances in Knowledge Discovery and Data Mining. PAKDD, 2002. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2336: 535-548.
CrossRef
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|