K-mean clustering and its real usecase in the security domain

Mr Robot
5 min readSep 6, 2021

Clustering Analysis for Malware Behavior Detection in Cyber Crime

Cyber-attacks become the biggest threat in computer and networks system around the world. Because of that it is important to merge IDS that can detect and analyze the data with high accuracy (i.e., true positives and negative) and low false detection (i.e., false positive and negative) in the minimal detection time. So, K-Means clustering detection model with appoint of data mining, peculiarly clustering method is a notable field that can be explored to overcome this matter. It is a need to have continuous of IDS improvement in term of the accuracy of malware analysis, the detection time and the suitable detection approach; are the motivations for this research.

Malware Detection

Malware interrupt the file registry when entering a computer and basically malware tend to create and modify computer files system and Windows registry entries besides the computer interprocess communication and basic network interaction. Intrusion attack such as malwares are known to breach the policy of network security in organizations and continuously tries to interrupt the core fundamental of cybersecurity which are Confidential, Integrity and Availability or known as CIA. Therefore, previous cybersecurity researcher has proposed detection-based for malware intrusion, which is a framework that monitors the behavior of system activity. Then, the behavior will be analyzed by the framework and notify the users if there is a sign of intrusion

At any cost, the detection of malware is important and crucial as it conquer more than half of malware attack that exploit on the computer registry; and it can be detected by using Intrusion Detection System as the early defense over the malware attack Ransomware attacks: detection, prevention and cure. One of the solutions in detecting any intrusions is an Intrusion Detection System (IDS) to avoid the network and computer system from any cyber attack

Malware Detection by using K-Means Clustering

K-Means clustering is a method of cluster analysis in which the defined ‘k’ is separating the clusters with the existence of center value between all the grouped objects. However, in data mining perspective, the implemented K-Means clustering algorithm separates the time interval between the normal and abnormal data in the same training dataset. Differ from database manners, clustering can be referred as the capability of many servers or instances to connect to one database while in IDS, clustering technique is usually use within anomaly detection in exploring group of malware data information without knowing the former relationship knowledge of the data. So, clustering method clusters the objects according to their characteristic of data points, in such every single data point in a cluster is identical to those in the same cluster, but diverse from another clusters. Clustering is one of the most admired concepts in the domain of unsupervised learning as the anomaly detection is generally unsupervised detection. The idea is the same data points tend to belong to same groups or clusters, as identified by the distance of the data from the local centroids.

The graph shows that there are only two centroids, which are marks as „X‟. The „X‟ mark depends on the number of cluster that is defined in the first step of the process. The resulting cluster centroids are then used for fast anomaly detection in monitoring of new incoming data [44]. The KMeans clustering algorithm is one of the simplest unsupervised learning algorithms as shown in Fig. 4 that resolves the clustering problem [8] by:

  1. Collecting dataset of malware.
  2. Identifying the number of clusters (k).
  3. Initializing the k centroids (k-means) for the data.
  4. Determining the distance of each malware from each centroid and then assign each malware to the cluster with centroid closest to it.
  5. Recounting the centroids for each cluster.
  6. Steps 4 and 5 are repeated until there is no change in cluster centroids.
  7. If formed clusters do not look reasonable, repeat the steps 1–6 for different number of clusters.

In K-Means clustering method, the whole dataset are transform to Voronoi cells by taking observations and finally create the „k‟ groups in which every observation is a segment of a computed nearest mean cluster. It means that it creates ‘k’ similar clusters of data points and the data instances that fall outside of these groups could potentially be marked as anomalies. Thus, K-Means is a widely used clustering algorithm and this algorithm can be said as the most popular clustering algorithm among the geometric procedures Survey on anomaly detection using data mining techniques because of its computational simplicity, efficiency and ease of implementation . As it is straightforward algorithm, the computational time is faster then the other algorithm, thus the time of malware clustering process can be minimized

K-Means clustering is combined with Euclidean Distance based classifier correctly classified more than 14m DNS transactions of 42,143 malware samples concerning DNS-C&C usage then, uncovers another bot family with DNS C&C. In addition, this method correctly detected DNSC&C in mixed office workstation network traffic. For instance, DNS C&C provide a mechanism to detect DNS C&C in network traffic.

All the processes are classified into four main phases and the phases of the detection model describes as follows:

Phases 1: Binary Execution Phase In this phase, the binary file is run in virtual machine that is Drakvuf environment. Then, all the activities are captured as log format.

Phases 2: File Extraction Phase Then, all the data, which is the malware activities are extracted in this phase. There are two types of data that are extracted; first, default file (normal activities) and second infected file (suspicious activities).

Phases 3: Registry Data Extraction Phase After that, all the collected registry data is extracted and prepared in this phase, as the extracted data are imbalance data.

Phases 4: Clustering Phase The last phase is clustering phase in which the balanced data is analyzed by using K-Means clustering algorithm to cluster the data either it is malware or not. Euclidean Distance formula is used to measure the distance of centroid and data points. The formula is shown in Fig.

--

--

Mr Robot
0 Followers

Cyber Security Enthusiasm || Docker || RHE-8 || Working On Web Development || CTF Player || Machine Learning || Working On Flutter || Ansible