Malicious Cyber Adversaries may compromise the security of a system by denying access to legitimate users. This is often coupled with immeasurable loss of confidential data, which leads to hefty losses in both financial and trustworthiness aspects of a corporation. Malware exploits key vulnerabilities in applications presenting problems such as identity theft, unapproved software installations, etc. Abundance in malware detection and removal techniques in the ever evolving field of computers, presently exhibit a lower level of efficiency in detecting malicious softwares. Techniques available currently enable detection of softwares that are embedded with known signatures. No doubt these methods are efficient. However, most malware writers, aware of signature-based detection methods are working towards bypassing them.
Machine learning based systems for malware classification and detection have been tested and proved to be more efficient than standard signature-based systems. A vital reason and justification providing a strong foothold for using machine learning techniques is that even unseen malware can be detected, thus eliminating malware detection failures and providing very high success rates.
Our method uses efficient machine learning techniques for classification and detection of portable executable (PE) files of various malware classes commonly found in computers running Windows operating systems. For malicious files, computation of the distance between two files should yield an indication of their similarity. Using this as a basis, this thesis analyses the different approaches which can be employed for classifying malicious files using a method known as rank distance. This distance measure has been combined with a feature extraction method known as mutual information which analyses the opcodes n-gram sequences extracted from the PE files and segregates the most relevant opcodes from these. The most relevant opcodes, thus obtained, are used as features to identify which class a given file belongs to. An opcode relevance profile generated based on mutual information and the unclassified file are compared and assigned the respective rank distances for every class. Using these ranks, a distance between the two files is obtained. The class which has the least distance to the file is concluded to be the class of the file under scrutiny.