Modeling Malware as a Language
Author URLs
Document Type
Article
Publication Date
2018
Subject: LCSH
Malware (Computer software), Programming languages (Electronic computers)--Semantics, Task analysis, Computer viruses, Natural language processing (Computer science), Natural language generation (Computer science), Pattern recognition systems
Disciplines
Computer Engineering | Computer Sciences | Electrical and Computer Engineering
Abstract
Malware detection and malware construction are evolving in parallel. As malware authors incorporate evasive techniques into malware construction, antivirus software developers incorporate new static and dynamic analysis techniques into malware detection and classification with the aim of thwarting such evasive techniques. In this paper, we propose a new approach to static malware analysis, aiming to treat malware analysis as natural language analysis. We propose modeling malware as a language and assess the feasibility of finding semantics in instances of that language. We concretize this abstract problem into a classification task. Given a large dataset of malware instances categorized into 9 classes, we isolate strong semantic similarities between malware instances of the same class and classify unknown instances by strength of similarity to a class. Our approach consists of a proposed method for defining a malware-language, where malware instances are documents written in that language. We use the word2vec model to generate a computational representation of such documents and choose a document-distance as the measure of semantic closeness between them. We classify malware-documents by applying the k nearest neighbors algorithm (kNN). Validating our model using leave-one-out cross validation, we record a classification accuracy of up to 98%. We conclude that we can find, and ultimately manipulate semantics in malware.
DOI
10.1109/ICC.2018.8422083
Repository Citation
Awad, Yara; Nassar, Mohamed; and Safa, Haidar, "Modeling Malware as a Language" (2018). Electrical & Computer Engineering and Computer Science Faculty Publications. 113.
https://digitalcommons.newhaven.edu/electricalcomputerengineering-facpubs/113
Publisher Citation
Y. Awad, M. Nassar and H. Safa, "Modeling Malware as a Language," 2018 IEEE International Conference on Communications (ICC), 2018, pp. 1-6, doi: 10.1109/ICC.2018.8422083.
Comments
Article part of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20-24 May 2018.
University of New Haven Community can access the full-text through our IEEE Xplore Electronic database.