Taint Data Flow Analysis Implementation

This pages serves as notes documenting the taint data flow analysis and how it's implemented in Find Security Bugs.

Intro

General Info

The whole point of taint data flow analysis is to find a call path in a program that connects attacker-supplied data and sensitive places. Identify when an user-supplied dirty/tainted data hits vulnerable sink, in other words: when the shit hits the fan.

Here we don't discuss what is taint data flow analysis, what is a taint or what is a sink. Just in short - the analysis is trying to connect user input (taint) with dangerous application methods (sinks). When a tainted data hit a sink it means a problem - user input hits sensitive application method. So an attacker can trigger a vulnerability.

Find Security Bugs' 2-phase taint data flow approach

The taint data flow approach in Find Security Bugs has 2 phases.

Analyze the data flow inside an analyzed method and assign a "label" to each data flowing through the method. The "label" is Taint class with context information. If it's based on user input then assign a "dirty" flag - State.TAINTED, if it's safe to use (the data is a string constant) then assign State.SAFE, if it can't reason then assign State.UNKNOWN and note the source.
Analyze sinks inside the analyzed method. Once a method is analyzed with TaintDataflow instance the implementation brings manually curated list of sinks (every detector has different ones), iterate over calls in the method and analyze what/if a tainted data flows into the sinks in the method.

Taint data flow execution

Outside FindSecurityBugs

FindBugs (SpotBugs) gathers all analyzed class files ("application" + "referenced" from application classes) and builds an execution plan.

It first run FirstPassDetectors on the referenced classes (Find Security Bugs have no such detectors).
Then runs the rest of Detectors (all FindSecBugs Detectors) on the application classes.

Each Find Security Bugs Detector is called on each app class file. The outer loop goes over files (method call-order sorted) and inner loop over detectors. This is how a class got analyzed by each Detector.

Inside FindSecurityBugs

The flow in each of the special taint data flow detectors (not all detectors in Find Security Bugs are taint data flow based):

The detector grabs the analyzed class and lists all methods in a reverse call order
Then analyze each method in 2 logical phases:
- It gets (runs) TaintDataflow analysis of this method (phase 1)
- Run phase 2 by going over the method bytecode instructions to find "invoke"s. If it's entering one of "sinks" then analyze the arguments' taint reaching the sink and report bug (or keep it for later use).

Why is reverse call order so important? Because this way it can first analyze the tree leaves - methods that don't depend on others. Once those methods are analyzed the implementation has already some hints derived when it comes to calling these methods.

Analysis considerations and drawbacks

CPU + Memory drawbacks

Memory drawback: the whole process have to save the whole package of data in memory:

There are taint configs and sink configs loaded during startup from TXT files
New method configs are derived during taint data flow
DataFlow analysis is cached for each method with start+end state for each code block
There are derived sinks and their propagation saved in each detector run

CPU drawback: each analyzed class bytecode is processed several times, once during TaintDataflow and then every time with every Detector. Since TaintDataflow keeps only partial flow information (code block start+end facts) it has to execute partial data flow every time the analysis asks for an instruction in a middle of a code block.

So the more files are analyzed the bigger memory consumption is and more CPU it takes. Hence it's discouraged to analyze everything at once unless you have a lot of CPU+Memory.

Inter-procedural or intra-procedural

Both taint analysis phases are trying to be inter-procedural - connecting multiple methods. To keep the state after each method intra-procedural analysis:

TaintDataflow analysis saves result of the method analysis. It derives how the method manipulates it's parameters (arguments) and what is the return taint and then stores it for later use in TaintConfig.
Sink Detectors are saving call site + parent method when unsafe taint hits their sink in local storage. This helps propagate the sink up in the CFG.

So the implementation is building the information bottom-up on both sides (1) TaintDataflow analysis is deriving how data is tainted through the CFG (2) and how sinks are called with data through CFG, until all methods are analyzed. Then the analysis can say when a sink is hit with a real user input or unused (despite having a vulnerability).

Context sensitivity

It's important to say that the implementation does NOT contain context-sensitive analysis because it analyzes each method only once, not for every call site.

On the other hand the analysis gathers real class types of method invocation behind interface calls as well as null values and constants. It allows to manually configure specific Java types to be secure or immutable. This provides a bit of context and type-sensitivity for removing false-positives.

General limitations

There are other general limitations impacting the analysis results:

There is context outside of reach: application classes that are analyzed are usually only a part of a bigger picture with frameworks and JDK behind. It's discouraged to analyze everything so the implementation must make "shortcuts" in forms of manual taint configuration files.
The implementation doesn't analyze all possible ways how data can reach sinks: it's missing full context-sensitive analysis, evaluating all if-conditions (reachability) or building every possible heap to connect each field and variable with data on heap (points-to analysis).
There are heap storages that are "hidden" like object fields, Java collections, ThreadLocals and similar. Without context-sensitive analysis it doesn't know under which key a hashmap stores tainted value or what is a value of a deep nested object field.
There are call graphs where the implementation is unable to connect the dots so it loses track of the taint before it can hit a sink. For example using Java reflection. Without proper string-solving analysis that would be done in FindBugs (SpotBugs) it's impossible to follow the track.
There are call graphs where it over-connects the dots. For example having CFG reaching a list of classes implementing an interface. Having one of the implementation class vulnerable the implementation has to assume the whole list to be vulnerable because it doesn't know which class is picked from the list to continue CFG.

For such cases the analysis implementation use UNKNOWN taint for data and report also sinks that are "not called". This ensure all true-positives are found, but also many false-positives :)

Code Overview

FindSecBugs exports 2 main features to FindBugs (SpotBugs) related to the taint data flow analysis:

TaintDataflow analysis for analyzing taint data flow inside methods.
Taint data flow Analysis Detectors inheriting from Detector -> AbstractTaintDetector -> AbstractInjectionDetector -> BasicInjectionDetector for analyzing sinks.

Beside it also contains other detectors that are based on the classic Detector interface or OpcodeStackDetector.

There are basically 3 types of Taint states

TAINTED - means the data contains user input. The list of methods producing user input is manually updated in taint-config directory.
UNKNOWN - means the data is coming from a source that is not known, like a method parameter, unknown method call
SAFE - means the data is known to be safe, coming from manually configured taint config, safe method return value, a constant or similar.

TaintDataflow - Phase 1

Taint Analysis Registration

The analysis is registered into FindBugs (SpotBugs) by:

findbugs.xml: <EngineRegistrar class="com.h3xstream.findsecbugs.taintanalysis.EngineRegistrar"/>
EngineRegistrar: new TaintDataflowEngine().registerWith(cache);
TaintDataflowEngine: iac.registerMethodAnalysisEngine(TaintDataflow.class, this);

There's nothing important there beside the registration.

Triggering Taint Analysis

Any code in FindBugs (SpotBugs) can request a method taint analysis by calling: Global.getAnalysisCache().getMethodAnalysis(TaintDataflow.class, methodDescriptor); which triggers com.h3xstream.findsecbugs.taintanalysis.TaintDataflowEngine#analyze (in case it's not already cached by global analysis cache).

The analysis is actually triggered from taint detectors. FindBugs (SpotBugs) calls com.h3xstream.findsecbugs.injection.AbstractTaintDetector#visitClassContext which lists class methods in a reverse call order and then performs analysis on each of the method in AbstractTaintDetector#analyzeMethod.

AbstractTaintDetector#analyzeMethod is responsible for triggering "phase 1" by obtaining TaintDataflow analysis by calling AbstractTaintDetector#getTaintDataFlow -> Global.getAnalysisCache().getMethodAnalysis(TaintDataflow.class, descriptor)

This is how the method taint data flow analysis is triggered in com.h3xstream.findsecbugs.taintanalysis.TaintDataflowEngine#analyze.

One important thing is that TaintDataflowEngine keeps an instance of TaintConfig that holds content of TXT files about outer world methods and fields and will be used to save derived analysis result.

Analysis Execution

TaintDataflowEngine#analyze

Initializes TaintAnalysis (with TaintFrameModelingVisitor) and TaintDataflow.
Executes the dataflow
Finishes the analysis.

Initialization

Initialization - nothing important, perhaps only that TaintFrameModelingVisitor creates TaintMethodConfig to be saved later into global TaintConfig as the analysis result.

Dataflow execution

... Where the magic happens ...

FindBugs' (SpotBugs') Dataflow execution is the most interesting part. I won't explain how it works because I don't understand it fully and I'm constantly surprised that it works. The important part is that it's using CFG (Control Flow Graph) with FrameDataflowAnalysis to navigate Frames with data through the code execution.

Important to say that each Frame models JVM's operand stack for instructions to act on. So every instruction block have it's Frame. The taint analysis pushes and works with Taints instead of real values like JVM would be doing - modelling Taint flow through the method instead of real values.

TaintAnalysis initializes new TaintFrame when enters the analyzed method (#initEntryFact) with taints for incoming method parameters
DataFlow iterates through CFG blocks and analyze each block instruction by calling TaintAnalysis#transferInstruction.
1. TaintAnalysis is implemented in a way that it triggers TaintFrameModelingVisitor for every instruction visited.
2. The visitor does it's magic with TaintFrame based on the instruction and then it leaves.
DataFlow transfers from one block to another through "an edge" via TaintAnalysis#meetInto and usually only merges the two stacks (from->to) with TaintAnalysis#mergeValues.
DataFlow transfers into return instruction where TaintFrameModelingVisitor#visitARETURN and #visitReturnInstruction computes method result.

Finalization

Analysis finishes - Here the analysis saves the result calling TaintFrameModelingVisitor#finishAnalysis that saves the method config into a global TaintConfig instance. When other method calls this analyzed method it will know return Taint based on incoming arguments/parameters.

Sometimes the analysis doesn't finish and throw DataflowAnalysisException "Too many iterations". This is because BasicAbstractDataflowAnalysis caches block's start+finish facts and DataFlow repeats data flow analysis through CFG unless it finds a fixed point - no change to every block's result fact. Some methods can be too complex with too many edges and DataFlow analysis is unable to find a stable fixed point within the default 97 cycles, in such cases it's important to increase dataflow.maxiters System property.

Next steps

The TaintDataflow is saved into global AnalysisCache to be used when needed. Next step in the flow is sink matching.

Sink matching - Phase 2

Matching sinks with taints continues right after deriving method taint data flow analysis.

Entering Sink Locations Analysis

To find sinks AbstractTaintDetector#analyzeMethod() now iterates over method bytecode to find method invocation and analyze it using AbstractTaintDetector#analyzeLocation()

Once it finds a method call it divs into #analyzeLocation(), implemented in AbstractInjectionDetector#analyzeLocation().

Analyzing Sink Location

The goal is to find whether current method invocation is matching: 1, Either manually configured (static) sinks for this detector 2, or sinks derived and propagated as the result of the first step.

Hence the sink analysis is executed in 2 steps:

Check derived sinks to see if there's a record of propagated sink
Check static sinks and receive arguments that can be injectable

For the sake of clarity I'm going to introduce a naming convention in this text.

It's important to recognize we have 2 methods here:

The currently analyzed method ("method in subject") - where analysis is searching for all the sinks.

The method invoked by current instruction - possible "sink method"

We also have parameters entering both methods. For clarity I'm going to name them as parameters and arguments.

The currently analyzed method ("method in subject") will have "parameters"

The invoked "sink method" will have arguments

Now we can continue.

Both steps work similarly. They analyze taints flowing into the "sink method" as method arguments. If the arguments are not influenced by external conditions (parameters) then the analysis can report bug directly. But when the sink arguments are influenced by "method in subject" parameters then the analysis registers "method in subject" to be also sink leading to the sink method eventually.

Checking Static Sinks

When checking static sink the analysis is checking Taint of sink arguments. Thanks to already finished TaintDataflow analysis there is a prepared TaintFrame stack full of Taints for every instruction.

The analysis grabs the Taint from stack (TaintFrame) corresponding to the sink argument and checks if it's safe to be ignored.

When the Taint is not safe and depends on "method in subject" parameters the analysis have to register the "method in subject" to be also a sink - it's derived sink. It saves 2 important information:

Registers the sink under "method in subject" to have all static sinks in one place associated under the "method in subject".
Registers taint for each of the sinks.

On the other hand, when Taint has no parameters then there are no external sources that the analysis tracks and analysis can report a bug directly.

Of course there can be other external sources of the Taint beside parameters that influence the method behaviour. For example object fields or other invoked method calls in the "method in subject". But the analysis is not using them for sink propagation and rather mark them as "taint sources".

Checking Derived Sinks

This check is similar to checking static sinks with the difference that:

It will get multiple (all) sinks in the invoked method - all call sites that are derived earlier
It merges Taint of the sink with incoming parameters to get final Taint
It doesn't create a new Sink but propagates existing Sink object

The analysis iterates over every static sink previously registered under the invoked method. The invoked method is not static sink but a "derived sink" and encapsulates multiple static sinks with respective "sink taint".

Then again grabs Taint from stack for every sink argument and merges to get a final Taint.

And again, when the final Taint is influenced by incoming parameters the analysis registers current "method in subject" as derived sink:

On the other side when the final Taint is NOT influenced by incoming "method in subject" parameters then underlying propagated sink is informed that here it's called with SAFE/UNKONWN/TAINTED taint from this call site

Reporting

When FindBugs (SpotBugs) finishes a pass with processing of all detectors on all classes in the current pass there's reporting phase. Each detector is called with '#finishPass()' method that transfers to AbstractInjectionDetector#report().

There the analysis gathers the remaining derived Sink instances and triggers InjectionSink#generateBugInstance().

Other interesting code parts

TaintFrameModelingVisitor magic

TaintFrameModelingVisitor is a magical class that manipulates stack based on instructions.

Manipulating TaintFrame Stack

To understand how TaintFrameModelingVisitor works it's important to understand JVM Instruction Set.

In short - TaintFrameModelingVisitor is implementing a visitor pattern (of course) and models JVM operand stack based on every important JVM instruction. The stack (represented by TaintFrame) is changed based on how many stack values are consumed and produced by each instruction. Instead of calling real instructions and operating with real values TaintFrameModelingVisitor is manipulating with Taint objects on the stack depending how the data is changed.

For example, when a constant is loaded into stack with LDC instruction the specification says:

The Operand Stack doesn't have anything to be consumed but the instruction produces a value to the stack.
The spec also says what is returned by the instruction (int, float, String constant, class ref, dynamically computed constant), which are safe sources for the visitor.

The visitor puts a State.SAFE Taint onto the stack. Later when a method is invoked with this constant the TaintFrame will have this safe Taint on the stack and the analysis knows the method call parameter is safe.

TaintFrameModelingVisitor.visitInvoke()

The most important method is TaintFrameModelingVisitor#visitInvoke(). This method is the one that configures Taints in TaintFrame for sink matching.

Method retrieves TaintMethodConfig for the invoked method. The config is actually a "receipt" from TXT files in taint-config directory that says:

How the taints of method arguments transfer into return value Taint
How the method influence arguments that are mutable

First step

The first step - #getMethodTaint() is simple and concerns return value. Based on the receipt the visitor takes invoked method return value and merges it with current stack. If the result is UKNOWN or even TAINTED then marks this position. The actual merging is a bit more complex but more or less only carries over all information from the parameters into the result, including non-parametric state. Non-parametric state is a Taint.State that is not influenced by parameters, for example can be SAFE from inside the method because there's nothing dangerous inside.

Second Step

#taintMutableArguments() is the opposite of the first step where arguments are read from stack. This step is concerned with updating those arguments because some of them can have mutable type (mainly objects and arrays).

It can happen the method is known for the analysis because it's one of those methods that were already processed and so when the methodConfig is dynamically derived by this analysis with extra information the code updates arguments on the stack and associated variables at their locations on the stack.

If the method was not processed by the analysis then it doesn't know if arguments were tainted inside or not. Therefore it merges the mutable arguments with defaultValue() = UNKONWN Taint.

Lastly #transferTaintToMutables() actually uses TXT configuration from taint-config to update the arguments. The logic behind is that it allows developers to manually correct and fix the reasoning.

The receipt is based on the return Taint being merged into arguments and is used in receipts like java/util/List.add(ILjava/lang/Object;)V:0,2#2. The receipt says the list itself is tainted when a tainted value is added into it (java.lang.Object has index 0, int has index 1, the list has index 2: merge(0,2)->2).

Hence the code modifies the taint on the stack for this method invoke instruction as well as any variables associated with this.

Method end

In the end this method pulls invoked method parameters from stack and pushes the return value. It actually removes arguments (Taints) from TaintFrame and pushes return value (Taint) back to TaintFrame.

Then it calls an extension point for additional visitors to be able to react.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly