Machine learning in cybersecurity: what is it and what do you need to know?

Machine learning in cybersecurity will enormously bolster spending in big data, intelligence and analytics, says ABI Research

Recent breakthroughs in machine learning and artificial intelligence mean AI-enabled technologies are gaining traction. The billion-dollar cybersecurity industry is no exception, as vendors begin to scale and automate their processes intelligently - all while locked into the early stages of a security arms race with professional hackers.

A recent report from analyst firm ABI Research estimates that machine learning in cybersecurity will enormously bolster spending in big data, intelligence and analytics, reaching as much as $96 billion (£71.9 billion) by 2021.

Vendors are likely to find buyers in large enterprises, and more than likely, across industries that are especially prone to attack: think government and defence, banking, and across the technology sector. At the moment, ABI's report says, User and Entity Behavioural Analytics - using machine learning for threat detection by analysing data at scale - is the driving force.

"Using static machine learning models to detect previously unknown malware is the only use case I'm aware of that offers clear evidence of effective results," says cybersecurity analyst at 451 Research, Adrian Sanabria.

"Most machine learning use in the industry right now is experimentation and seeing what sticks," Sanabria says. "The fact that machine learning has had some success in one area of infosec practically guarantees that this industry will attempt to use machine learning anywhere and everywhere it can be shoehorned in."

But threat detection is not a trivial matter: in Cisco's recent annual cybersecurity report, it noted that the vast majority of companies are working to improve their threat detection capabilities.

See also: Why machine learning could be the next frontier for data centre operations

There are plenty of public breaches where not only was the organisation unaware of the intrusion until it was far too late, they had no idea about the true extent of the breach. A case in point is the devastating Yahoo hack - where eventually the company discovered 1 billion email account details were compromised.

The author of the ABI paper, Dimitrios Pavlakis, tells Computerworld UK that to understand why machine learning is useful for detection, it's important to define the two primary distinctions for machine learning.

Supervised applications of machine learning tend to mean that you have clean and structured data - for example, anything that could be read in Excel - where you treat the model with what you know and what you then expect the software to do. In this case you tell the algorithm what to do and where to look.

But unsupervised applications can examine unstructured data from multiple data sources. "With unsupervised models in machine learning you can then teach a model using neural networks and deep learning," says Pavlakis. "You can teach a machine learning algorithm to detect the unknown. The algorithms are being trained, models are being stacked and trained all together.

"So you feed them data and tell them this could be normal, for example, and if something strange happens, the unseen threat can then be flagged."

Some cybersecurity vendors in the machine learning space include Splunk, Gurucul, and Vectra, Trend Micro, Symantec, Invincea, and CrowdStrike, and giant enterprises like IBM are also doing work in the field.

Many eyes

"What machine learning offers is the 'many-eyes' option," explains Gunter Ollmann, CSO at Vectra Networks. "You can use machines to observe the network continuously in real time, and correlate that across hundreds of millions, to trillions, of events on a daily basis.

"A traditional approach from a security practitioner perspective is to take logs, drop them into some central database, and then, offline, mine that data for events that we have a feeling might be there," he says. "What machine learning offers is that all of the work can be done in real time, live in a network wire and without that human oversight."

Andrew Gardner, senior director for machine learning at Symantec, explains that where machine learning will really help is in scale and automation. Think of the difference, he says, between two humans playing chess and two computers playing chess. And the computers can play each other at very high speeds.

"One thing that's useful for is it allows us to do predictive testing," he says. "We can, in a sandbox, use AI machine learning in the same way that an attacker might do, to predict and explore possible exploits on a scale that humans just can't achieve."

See also: Cybersecurity trends 2017: Malicious machine learning, state-sponsored attacks, ransomware and malware

Today, machine learning is best suited for solving problems external to customers, says 451's Adrian Sanabria.

"Web reputation and malware prevention, for example, involve analysing publicly accessible data, whereas any use of machine learning internally by a customer may require extensive customisation and training before it can be useful or effective," Sanabria says.

That's not to say internal applications will not be useful - but it is probably the sole domain of large enterprise-level organisations for now.

Chess

If threat detection is the most pertinent use of machine learning technologies today, what might this level of intelligence be able to achieve tomorrow? Developments that are on the horizon suggest machine learning could soon help with other security applications.

"When repeated actions can be automated in a trusted way, we'll see machine learning making incremental steps moving from detection, into remediation and then mitigation," says Vectra's Gunter Ollman.

"A key part about this is that they are learning systems," he says, adding that when Vectra's systems are deployed in a live environment they never stop learning.

"They observe how the end user, the customer security expert behind that pane of glass is responding to things that are there," Ollman says. Picture someone on the security team receiving a threat rating of 80 percent, and reacting by rating it much higher - the systems begin to learn the human processes of the organisation.

Symantec's Andrew Gardner adds that augmenting human talent will be a major part of the "natural evolution" of machine learning.

"We have an email product and one of the key features people are interested in now is finding targeted spearphishing attacks," he says. "These mostly come through emails - it's a tricky problem, machine-learning wise, because a really well-targeted spearphishing attack is crafted programmatically at a level that's really hard to detect. It's targeted to the right person at the right time, it probably doesn't spam everybody and it may even be unique.

"We have analysts that can review suspicious emails, but without machine learning, they're going through them one to ten at a time," he continues. "We've built tools where you meld the human and the computer together, and the machine learning system learns to hunt for the person. It's saying: 'I can't detect the threat, but I can detect what the user looks for - I can cut through this in vast swathes of the time'."

Gardner says without hyperbole that the developments we've started to see in machine learning are the first steps towards a self-evolution of technology - although he stresses that he doesn't mean that "quite like a Terminator movie".

"We have a system in our backend that teaches one of our antivirus detection engines how to improve itself," Gardner says. "The engines are set up in an automated fashion, so it learns: 'hey, you've made a mistake from this kind of data, you need to be focusing on this,' etc."

Rogue AI

The tendency for attackers and vendors to develop their capabilities in parallel complicates matters somewhat. If these incredible technologies can be used to defend, they can also be used to attack. It's uncontroversial to say that vendors and attackers are locked into an arms race with no real end in sight.

According to ABI's Pavlakis, the flock towards adopting machine learning is evidence of a reaction against the increasing sophistication and scale of attackers.

"Machine learning in cyber security is, of course, a natural evolution of the technology," he says. "But it is also a reactionary measure. Attackers by definition are always on the offensive, and they often have access to the same products that users and companies do - AV, security protocols, IP systems - they can study these inside and out.

"So attackers train algorithms to break into these products, they can buy an AV system, and they can test them - until their final product could be an algorithm that can break the antivirus, and until the final product is ready to work on a larger percentage of machines and systems out there.

"Companies now make use of more sophisticated machine learning models as a result of higher threat factors."

Splunk specialises in user behaviour analytics for threat detection. Matthias Maier, security evangelist for Splunk, predicts a more forbidding future for threats that make use of machine learning, especially as machine learning models become more accessible.

"What we expect will come up - though it's not there yet - is that attackers might use machine learning to automate very targeted attacks," Maier says. "Today they run a lot of complex attacks manually, and once they start using machine learning, they'll be able to merge attacks like social engineering, research, phishing, delivery, credentials theft, and ransomware payouts. Once they start using machine learning to connect those different attack types they can automate manual processes."

Vectra's Gunter Ollman warns that professional attackers are studying machine learning very closely - and many of them are already data scientists.

"This is no different from 10 years ago when behavioural learning systems came out that the bad guys invested their own time, and they found ways to detect and bypass the sandboxing technologies," he says. "I expect we'll see that same level of thought and actions going into machine learning and artificial intelligence."

Ollman says attackers are already automating parts of their large-scale offensives. Worse still, there's every possibility that malware equipped with artificial intelligence could be set loose online, a rogue, intelligent design that automatically and silently infiltrates systems for data.

Traditionally, Ollman explains, the command and control server - used to remotely send malicious commands to botnets or other compromised systems - has been the "Achilles' heel" for attacks.

"The area we're most worried about is that many of today's malware threat detection systems are focused on the command and control aspects of malware, VPNs, and other devices that have been installed inside corporate networks," he says. "The scary part is that as AI and machine learning advances, it is inevitable that these learning modes will be planted inside the malware. Once the malware has got inside the network, it will do away with the command and control necessity, and it will be automatically intelligent, and programmed to hunt out and seek data.

"The only time network traffic will be observed is when it's completed the infiltration of that data."

Snake oil

So it's not all hype. Machine learning is concretely being used to protect enterprise-grade infrastructure today. But as with the development of any new technologies there is always a danger some vendors will latch on to the buzzword. What can organisations do to avoid buying into snake oil?

"I fundamentally believe that one of the best ways, and the best vehicles for understanding the scope and the capability of a technology from a vendor, is to look closely at their security research team and data science team," says Vectra's Gunter Ollmann. "First of all, if they don't have either of those, then there is no way they can be doing machine learning or artificial intelligence.

"The second one," he says, "is that effectively all the successful technologies that are out there in this space are products that are based on the capabilities of their research team - and the product is that research team wrapped into code, or into tin.

"So there are no security products out there that are more than what those research teams are capable of putting into that. Particularly when looking at the startup world but even some of the larger vendors in the security space, a closer look at their security research teams and science teams, what their pedigree and what the sizes are, has proven to be a clear indicator of their capabilities, the product performance, and their ability to detect and mitigate a threat."

It's important, then, to have layers upon layer of protection, prevention and detection, says 451's Adrian Sanabria - and not to be suckered in by slick marketeering that paints machine learning as an all-curing panacea.

"We know from experience that attacks will simulate what infosec vendors are doing," Sanabria explains. "I wouldn't be surprised if they've already duplicated the industry's machine learning work, and are working to determine ways to get around it, if they haven't already.

"Machine learning models depend on a degree of likeness, so if attackers find a way to produce malware that looks significant different from what models expect, machine learning-based detection methods could become ineffective overnight. This is one of the reasons it is important to have many layers of prevention, detection and hardening."

Show Comments