Modeling Malware as a Language
Malware (Computer software), Programming languages (Electronic computers)--Semantics, Task analysis, Computer viruses, Natural language processing (Computer science), Natural language generation (Computer science), Pattern recognition systems
Computer Engineering | Computer Sciences | Electrical and Computer Engineering
Malware detection and malware construction are evolving in parallel. As malware authors incorporate evasive techniques into malware construction, antivirus software developers incorporate new static and dynamic analysis techniques into malware detection and classification with the aim of thwarting such evasive techniques. In this paper, we propose a new approach to static malware analysis, aiming to treat malware analysis as natural language analysis. We propose modeling malware as a language and assess the feasibility of finding semantics in instances of that language. We concretize this abstract problem into a classification task. Given a large dataset of malware instances categorized into 9 classes, we isolate strong semantic similarities between malware instances of the same class and classify unknown instances by strength of similarity to a class. Our approach consists of a proposed method for defining a malware-language, where malware instances are documents written in that language. We use the word2vec model to generate a computational representation of such documents and choose a document-distance as the measure of semantic closeness between them. We classify malware-documents by applying the k nearest neighbors algorithm (kNN). Validating our model using leave-one-out cross validation, we record a classification accuracy of up to 98%. We conclude that we can find, and ultimately manipulate semantics in malware.
Awad, Yara; Nassar, Mohamed; and Safa, Haidar, "Modeling Malware as a Language" (2018). Electrical & Computer Engineering and Computer Science Faculty Publications. 113.
Y. Awad, M. Nassar and H. Safa, "Modeling Malware as a Language," 2018 IEEE International Conference on Communications (ICC), 2018, pp. 1-6, doi: 10.1109/ICC.2018.8422083.