There can sometimes be a fine line between suspicion and guilt. Determining malicious or “good” activity can be a challenging task in today’s cyber world full of hidden and dark secrets. A system based on accuracy and thorough analysis of all evidence will lead to the true malicious actor.

Consider a popular way to illegally extract money from someone’s bank account for example. The malicious actor has created a malicious link to exploit a vulnerability using a Cross Site Forgery Request (CSRF) on a vulnerable banking site. The malicious actor makes sure that the victim clicks on the malicious link while logged onto his online banking account with the vulnerability. He thinks he is transferring $2,000 to pay the rent but this malicious link changes the request to have $20,000 from his account transferred to an anonymous bitcoin account. The money is laundered away before it can be traced.

This is where the accuracy of a Web Application Firewall (WAF) is critical. A WAF blocks activation of the malicious link before it reaches the bank site with the corresponding vulnerability.

Let us assume the bank has a WAF to prevent such attacks, but the bank is reluctant to have strict WAF controls in place as they fear false positives. They have had users in the past complain that they were wrongly blocked from accessing the site when the WAF was previously tweaked. The answer for the bank is an accurate WAF that lets normal users through and blocks only malicious attacks.

So you ask yourself, how is WAF accuracy determined? After all there are true positives, false positives, true negatives and false negatives (pause for a moment to think which is which!). If false positives are minimised (do not suspect the innocent guy), then we might not find all true negatives (nailing the guilty one). If false negatives are minimised (never let the bad guy get away) we might as a side effect end up suspecting (and blocking!) many innocent people.

In such a system, mathematicians have deduced the correct way to maximise correctness taking into account all four outcomes. This is called the Matthews Correlation Coefficient (MCC). (https://en.wikipedia.org/wiki/Matthews_correlation_coefficient):

MCC = (TP*TN)-(FP*FN)/SQRT((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))

What an equation you may gasp! Well I initially did. However, there is a way to make it easier to understand. First note that TP are true positives, TN true negatives, FP false positives and FN false negatives. SQRT is the square root. Given that we are talking about a correlation coefficient, the size of the time series or data (i.e. lots of data with many false positives, negatives, etc) is critical as well to its relevance. Note also that a correlation coefficient is a real number between 1 and -1. We are seeking a correlation coefficient that is as close to 1 as possible in order to have an accurate system.

Let us analyse three cases based on some simple assumptions in order to better understand what this equation is actually representing.

Assumptions:

Case # | Mathematical Assumptions | Layman translation | Corresponds to real world scenario |

1 | FP, FN>>TP, TN | Number of false positives and false negatives is much much larger than number of true positives or number of true negatives | Worst situation as we are letting bad traffic through and blocking good traffic |

2 | TP ≈ FN ≈ TN ≈ FP | True positives, false negatives, true negatives and false positives are all roughly the same value | Not a very accurate WAF as it seems arbitrary and not making enough good decisions |

3 | TP, TN>>FP, FN | True positives and true negatives are much larger than false positives and false negatives | Ideal accuracy as bad traffic is being blocked and there is little if any good traffic being blocked |

So now let us plug in these assumptions into the equation (MCC = (TP*TN)-(FP*FN)/SQRT((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))) and see what results.

**Case 1:** MCC ≈ -(FP*FN)/SQRT(FP*FN*FP*FN))

Note that in order to get to the above approximation we have only included FN and FP since TP and TN are so much smaller they can be neglected!

Next step: Case 1 results in MCC ≈ –FP*FN / SQRT(FP2*FN2) ≈ -1

So in the worst scenario where FP and FN dominate the equation, we have a highly negative correlation coefficient that makes sense.

**Case 2:** MCC ≈ (TP*TN)-(FP*FN)/ SQRT((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN)))

But if TP, TN, FP and FN are all about the same then the dividend or number on top is approximately 0! Then we do not have to interpret the divisor and get 0 as the quotient.

So in the not ideal scenario, where all four outcomes are roughly equal, we have a 0 correlation coefficient, meaning the WAF in that case is not bad but also does not provide any real value.

**Case 3:** MCC ≈ (TP*TN)/ SQRT(TP2*TN2)

Note that in order to get to this approximation we have only included TN and TP since FP and FN are so much smaller they can be neglected!

Next step: Case 3 results in MCC ≈ TP*TN / SQRT(TP2*TN2) ≈ 1

So in the ideal scenario where true positives (letting the good traffic through) and true negatives (identifying and blocking the bad traffic) dominate the equation, we have a highly positive correlation coefficient, meaning a highly accurate WAF!

So with these cases based on assumptions we see that the MCC would work well indeed to determine the accuracy of a Web Application Firewall. It is much more advantageous to have a mathematically appropriate method to calculate accuracy. A system that has been tweaked to have a maximum MCC value ensures a low false positive AND a high true negative rate. This is much better than having the bank not use the WAF due to a high false positive rate or to have a low true negative rate allowing malicious users to get through.

So that is enough math for a single blog entry. My plea is that:

- you ask your WAF vendor how they calculate accuracy and therefore tweak their WAF!
- What is their rate of all four variables: True and False Positives, True and False Negatives?
- How much data has the vendor used to deternine the accuracy (remember the larger the data set the more reliable the correlation coefficient).

This article is brought to you by Enex TestLab, content directors for CSO Australia.