Modeling Malware as a Language

Document Type


Publication Date


Subject: LCSH

Malware (Computer software), Programming languages (Electronic computers)--Semantics, Task analysis, Computer viruses, Natural language processing (Computer science), Natural language generation (Computer science), Pattern recognition systems


Computer Engineering | Computer Sciences | Electrical and Computer Engineering


Malware detection and malware construction are evolving in parallel. As malware authors incorporate evasive techniques into malware construction, antivirus software developers incorporate new static and dynamic analysis techniques into malware detection and classification with the aim of thwarting such evasive techniques. In this paper, we propose a new approach to static malware analysis, aiming to treat malware analysis as natural language analysis. We propose modeling malware as a language and assess the feasibility of finding semantics in instances of that language. We concretize this abstract problem into a classification task. Given a large dataset of malware instances categorized into 9 classes, we isolate strong semantic similarities between malware instances of the same class and classify unknown instances by strength of similarity to a class. Our approach consists of a proposed method for defining a malware-language, where malware instances are documents written in that language. We use the word2vec model to generate a computational representation of such documents and choose a document-distance as the measure of semantic closeness between them. We classify malware-documents by applying the k nearest neighbors algorithm (kNN). Validating our model using leave-one-out cross validation, we record a classification accuracy of up to 98%. We conclude that we can find, and ultimately manipulate semantics in malware.


Article part of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20-24 May 2018.

University of New Haven Community can access the full-text through our IEEE Xplore Electronic database.



Publisher Citation

Y. Awad, M. Nassar and H. Safa, "Modeling Malware as a Language," 2018 IEEE International Conference on Communications (ICC), 2018, pp. 1-6, doi: 10.1109/ICC.2018.8422083.

Check your library