A beginner’s guide to static program analysis using Soot
In this blog post, I will show you an example that uses Soot to provide some insights about a Java program. This post is designed for the people who know Java programming and want to do some static analysis in practice but do not know anything about Soot and static analysis in theory. The repository that contains the example can be found at https://github.com/noidsirius/SootTutorial.
The Soot Tutorial Series
If you have some knowledge about static program analysis I suggest you learn Soot from here. There is no need to mention that I appreciate any feedback since I like to continue writing about Soot and static analysis in the future.
Why another tutorial for Soot? (or how did I motivate myself to write this post)
A while ago, I wanted to analyze an Android app and instrument it for one of my course projects. I was told Soot is one of the greatest frameworks for analyzing Java and Android programs both for researchers and practitioners. So I decided to learn Soot as fast as I can. There were several resources for learning Soot. Especially the Soot’s wiki page listed some useful and easy tutorials.
According to A Survivor’s Guide to Java Program Analysis with Soot’s authors, “Soot is a large framework which can be quite challenging to navigate without getting quickly lost” which suggests Soot has a steep learning curve. They were right; I got lost so quickly. Most of Soot’s guidelines assume the readers are familiar with theoretical parts of static programming analysis such as lattices or flow functions which I didn’t have any prior knowledge about. Moreover, these guidelines try to explain everything in Soot that most of Soot users won’t need at all in my opinion.
One way or another I learned Soot, at least some basic parts of it, but I realized those tutorials are not suitable for people with no background in static analysis. As a result, I decided to write this blog post to introduce Soot and static analysis using very simple (but working) examples. I heavily used Soot wiki and A Survivor’s Guide to Java Program Analysis with Soot to write this post.
Analyze FizzBuzz Statically
Static program analysis, in its simplest form, is a black box that inputs a program (code) and outputs some of the properties of the program. For example, let’s say we are interested in finding all the branch statements in a method and call this analysis BranchDetectorAnalysis. To illustrate this example, I am going to use a trivial program: FizzBuzz. FizzBuzz prints each number from 1
to 100
, but if the number is divisible to 3
/ 5
/ 15
it should print Fizz
/ Buzz
/ FizzBuzz
instead of the number. Here is a Java class that implements FizzBuzz.
As a result, the input of BranchDetectorAnalysis is a Java method (printFizzBuzz
) and the output will be the statements that branch the execution of the code ( lines 4, 6, and 8). Note that line 10 is not considered a branch statement since its condition is implicitly determined in line 8.
Alright! Let’s really do this dummy analysis with Soot. To get started, clone the SootTutorial
repository into your machine.
git clone git@github.com:noidsirius/SootTutorial.git
This repository contains the code that we will use through this post. You can open it with Intellij IDEA or just use gradle
in a command-line terminal. Please make sure that your Java version should not be higher than 8 since the current version of Soot does not support JPMS, Java Platform Module System, yet (you can use jEnv to manage different Java versions).
In order to verify the project is set up correctly, run the tests:
cd SootTutorial
./gradlew check
If everything goes well, you can run the analysis by
./gradlew run --args="HelloSoot"
The output should be the signature ofprintFizzBuzz
, its argument and this
variables, the body of printFizzBuzz
in Jimple, and finally the branch statements in the method. Now we will review how Soot will produce this information and what Jimple is. Please note that the main analysis method can be found in dev.navids.soottutorial.HelloSoot.java
.
Setup Soot
As I mentioned earlier, Soot is a complex software that has lots of configurable settings. As a result, I don’t go through the details of the setup except for the most important part which is setting Soot classpath. Soot considers all Java classes in this classpath as its input. In this example, the classpath is demo/HelloSoot
which contains FizzBuzz.class
. For more information regarding this part, you can check this link out.
Method body retrieval
In order to do BranchDetectorAnalysis on printFizzBuzz
, we have to retrieve its body. But we should locate the method first. Soot has some data structures to represent classes, methods, and statements of the input program.
Scene
is a singleton class that keeps all classes which are represented bySootClass
. Each SootClass
may contain several methods (SootMethod
) and each method may have a Body
object that keeps the statements (Unit
s). So, after setting up the Soot, we can access these objects via Soot API. The code snippet below, get the FizzBuzz
's SootClass
, find printFizzBuzz
method, and finally retrieve its JimpleBody that contains the statements of the method.
SootClass mainClass = Scene.v().getSootClass("FizzBuzz");
SootMethod sm = mainClass.getMethodByName("printFizzBuzz");
JimpleBody body = (JimpleBody) sm.retrieveActiveBody();
But what is Jimple?
Soot provided several Intermediate Representation (IR) of Java programs in order to make the static analysis more convenient. The default IR in Soot is Jimple (Java Simple) which is something between Java and Java byte codes. Java language is preferable for humans since they can read it easily and Java byte code is suitable for machines. Jimple is a statement based, typed (every variable has a Type
) and 3-addressed (every statement has at most 3 variables) intermediate representation. The code below is the representation of theprintFizzBuzz
method in Jimple.
There is nothing implicit in Jimple. For example, this
is represented as r0
which is a Local
object (the data structure of variables in Soot). Or the argument of the function is explicitly defined in i0
and its type is int
. Each line represents a Unit
(or Stmt
since the default IR is Jimple). There are 15 different types of Stmt
s in Jimple, but in BranchDetectorAnalysis, we are interested only in one of them; JIfStmt
. Here is the code that prints branch statements:
for(Unit u : body.getUnits()){
if (u instanceof JIfStmt)
System.out.println(u.toString());
}
body.getUnits()
returns the list (or more precisely Chain
)of units in printFizzBuzz
body. We simply iterate over these units and print any of them that are subclasses of JIfStmt
which are lines 4, 9, and 14.
Control-Flow Graph
The branch statements control the flow of the execution of statements. All possible paths that may be executed in a method are represented as Control-Flow Graph (CFG). Soot is capable of creating the CFG of methods through an interface called UnitGraph
. The image below visualizes the CFG of the printFizzBuzz
method. You can draw this image by running
./gradlew run --args="HelloSoot draw"
Here you can see there are four possible paths from the start of the method to its end and three branch statements are colored in blue. These paths are representing the numbers divisible to 3, 5, 15, or none of them.
Conclusion
In this post, I tried to show you how to use Soot in order to get some insight into a Java method. My primary goal was showing a working example (and provide its code and environment) to get a sense of the basic building blocks of Soot without knowing the complex Soot configurations. I hope to write another blog post to do a real static analysis with Soot.