Neural Nets Can Learn Function Type Signatures from Binaries [pdf]

bmc7505 | 213 points

TLDR:

> In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference.

> [..] we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries.

> In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively.

dmix | 7 years ago

I've long had this idea to build a decompiler (a program that maps binaries back to source code) using machine learning. The problem in decompilation is that you loose information when you compile source code. Machine learning could help even recover things like most likely variable names. There is also tons of training data that can be easily generated.

ma2rten | 7 years ago

I've often wondered about the theoretical limit of a neural net to learn from examples - seems like a fascinating subject with a lots of implications. From a quick search, I found this paper: https://experts.illinois.edu/en/publications/computational-l... which is already very interesting. Are there any other good pointers on this?

placebo | 7 years ago

COTS = "Common off the shelf", in case anyone is equally mystified. Google wasn't immediately helpful.

incompatible | 7 years ago
[deleted]
| 7 years ago

The authors have put out the datasets: https://github.com/shensq04/EKLAVYA

therandomoracle | 7 years ago
[deleted]
| 7 years ago

This would only be useful for binaries where you don't have corresponding source, but the binary does still have symbols. Is this a common situation?

haberman | 7 years ago