Machine Learning for Cybercriminals
Machine learning (ML) is taking cybersecurity by storm nowadays as well as other tech fields. In the past year, there has been ample information on the use of machine learning in both defense and attacks. While the defense was covered in most articles (I recommend reading “The Truth about Machine Learning in Cybersecurity”), Machine Learning for Cybercriminals seems to be overshadowed and not unanimous.
Nonetheless, the U.S. intelligence community concerns about the use of artificial intelligence. The recent findings show how cybercriminals can deploy machine learning to make attacks better, faster, and much cheaper to perform.
The objective of this article is systemizing information on possible or real-life methods of machine learning deployment in malicious cyberspace. It is intended to help members of the Information Security teams to prepare for imminent threats.
All cybercriminals’ tasks that can be aided by machine learning starting with initial information gathering to system compromise can be categorized into several groups:
- Information gathering – preparing for an attack;
- Impersonation – attempting to imitate a confidant;
- Unauthorized access – bypassing restrictions to gain access to some resources or user accounts;
- Attack – performing an actual attack such as malware or DDoS;
- Automation – automating the exploitation and post-exploitation.
Machine learning for Information gathering
Information gathering is the first step for every cyberattack, and no matter if it’s a targeted attack or one on multiple victims. The better you collect information, the better prospects of success you have.
As of fishing or infection preparation, hackers may use the classifying algorithms to characterize a potential victim as belonging to an appropriate group. Imagine, after having collected thousands of emails, you send malware only to those who are more likely to click on the link recognizing it as unsuspicious thus you reduce the chances of a security team’s participation. A number of factors may aid here. As a simple example, you may separate users who write about IT topics on their social networking sites from those focused on food and cats. As an attacker, I would choose the latter group. Various clustering and classification methods from K-means and random forests to neural networks can be used.
Concerning information gathering for targeted attacks, there is just one victim and complex infrastructure, and the mission is to get as much information about this infrastructure as possible. The idea is to automate all obvious checks including information gathering about the network. While existing tools such as network scanners and sniffers enable analyzing traditional networks, the new generation of networks based on SDN are too complicated. That’s where machine learning can assist adversaries. A little-known but interesting as a concept Know Your Enemy (KYE) attack that allows stealth intelligence gathering about the configuration of a target SDN network is a relevant example of applying machine learning to the information gathering task. The information that a hacker can collect ranges from the configuration of security tools and network virtualization parameters to general network policies like QoS. By analyzing the conditions under which a rule from one network device is pushed into the network and the type of the rule, an attacker can infer sensitive information regarding the configuration of the network.
During the probing phase consisting of a number of the attacker’s attempts to trigger the installation of flow rules on a particular switch. The specific characteristics of the probing traffic depend on the information that interests the hacker.
In the next phase, the attacker analyzes the correlation between the probing traffic generated during the probing phase and corresponding flow rules that are installed. From this analysis, he or she can infer what the network policy is enforced for specific types of network flows. For instance, the attacker can figure out that the defense policy is implemented by filtering network traffic if he or she uses network scanning tool in the probing phase. If you do it manually, it can take weeks to collect data and still you will need algorithms with preconfigured parameters, e.g. how many certain packets are necessary to make a decision as the number depends on various factors. With the help of machine learning, hackers can automate this process.
Those are two examples but generally, all information gathering tasks that require a great deal of time can also be automated. For example, DirBuster, a tool for scanning for available directories and files, can be improved by adding a kind of genetic algorithms, LSTMs or GANs to generate directory names that are more similar to existing ones.
Machine learning for Impersonation
Cybercriminals use impersonation to attack victims in various ways depending on a communication channel and a need. Attackers are able to convince victims to follow the link with exploit or malware after having sent an email or using social engineering. Therefore, even a phone call is considered a means for impersonation.
Email spam is one of the oldest areas in security where machine learning was used and here I expect ML will be one of the first areas applied by cybercriminals. Instead of generating spam text manually they can “teach” a neural network to create spams that will look like a real email.
However, while dealing with email spams, it is hard to behave like a man who you imitate. The point is that if you ask employees in an email to change their passwords or download an update on behalf of a company’s administrator, you would not manage to write it exactly in the same way as the administrator. You won’t be able to copy the style unless you saw a pile of his or her emails. Even so, this issue can be solved by network phishing.
The biggest advantage of social media phishing over email phishing is publicity and ease of access to personal information. You can watch and learn users’ behavior by reading his or her posts. This idea was proved in the latest research called Weaponizing Data Science for Social Engineering – Automated E2E spear phishing on Twitter. This research presented SNAP_R, which is an automated tool that can significantly increase phishing campaigns. While traditional automated phishing gives 5-14% accuracy and manually targeted spear phishing – 45%; their method is right in the middle with 30% accuracy and up to 66% in some cases with the same effort as automated one. They used Markov model to generate tweets based on a user’s previous tweets and compared results with the current neural network particularly LSTM. The LSTM provides higher accuracy but requires more time to train.In the new era of AI, companies create not only a fake text, but also a fake voice or videos. A Lyrebird, a startup specializing in media and video that can mimic voices, demonstrated that they can make a bot that speaks exactly like you. With the growing amount of data and evolving networks, hackers can present better results. Since we don’t know how Lyrebird works, and hackers probably aren’t able to use this service for their own needs, they can discover more open platforms such as Google’s WaveNet, which are able to do the same things.
They apply generative adversarial networks (GANs), more advanced types of neural networks.
Machine learning for unauthorized access
The next step is obtaining unauthorized access to user accounts. Imagine cybercriminals need to get unauthorized access to a user’s session. The obvious way is to compromise the account. For mass hacking, one of the annoying things is a captcha bypass. A number of computer programs can solve a simple captcha tests but the most complex part is the object segmentation. There are numerous research papers where captcha bypass methods were described. One of the first examples of Machine Learning was published on June 27, 2012 by Claudia Cruz, Fernando Uceda, and Leobardo Reyes. They used support vector machines (SVM) method to break system running on reCAPTCHA images with an accuracy of 82%. All captcha mechanisms were significantly improved. However, afterward a wave of papers appeared, they leveraged deep learning methods to break CAPTCHA. In 2016, an article was published that detailed how to break simple-captcha with 92% accuracy using deep learning.
Another research used one of the latest advances in image recognition – deep residual networks with 34 layers to break a CAPTCHA of IRCTC, a popular Indian website, also with 95-98% accuracy. These articles mostly embraced character-based CAPTCHAs.
One of the most inspiring papers was released on BlackHat conference. The research was called “I am a Robot”. They used to break the latest semantic image CAPTCHA and compared various machine learning algorithms. The paper promised a 98% accuracy on breaking Google’s reCAPTCHA.
To make things even worse, a new article states that scientists warn of forthcoming 100% CAPTCHA bypass methods.
Another area where cybercriminals may find advantages with the help of machine learning is password brute force.
Markov models were first that used to generate password “guesses” in 2005, long time before deep learning became so topical. If you are familiar with the current neural networks and LSTM you probably heard about a network that generates a text based on the trained text, as if you give this network a Shakespeare work, and it will create a new text based on it. The same idea can be used for generating passwords. If we can train a network on the most common passwords, it can generate a lot of similar ones. Researchers took this approach, applied it to passwords and received positive outcomes, which are better than traditional mutations to create password lists such as changing letters to symbols, e.g. from “s” to “$”.
Another approach was mentioned in one of papers “PassGAN: A Deep Learning Approach for Password Guessing” where researchers used GANs – generative adversarial networks – to generate passwords. GANs are special types of neural networks consisting of two networks; one is usually called generative and another is discriminative. While one is generating adversarial examples, another is testing if they can fix an issue. The core idea is to train the networks that are based on the real data about passwords from those which were collected from the recent data breaches. And after the publication about the biggest database of 1.4 billion passwords from all breaches, the idea looks promising for cybercriminals.
Machine Learning for attacks
The fourth area where cybercriminals want to use machine learning is the actual attack. In overall, there are three general goals for attacks: espionage, sabotage, and fraud. Mostly all of them are performed with malware, spyware, ransomware or any other types of malicious programs which users download because of phishing or attackers upload them on a victim because of the vulnerabilities. In any case, attackers need somehow upload malware on a victim’s machine.
The use of machine learning for malware protection was probably the first commercially successful application of Machine Learning for Cybersecurity, there are dozens of works describing different techniques how to detect malware using artificial intelligence (AI) but it’s a topic for another article.
How can cybercriminals use machine learning for creating malware? The first well-known example of AI for malware creation was presented in 2017 in the paper called “Generating Adversarial Malware Examples for Black-Box Attacks Based on GAN”, authors built a network called MalGAN.
This research proposes an algorithm to generate malware examples, which are able to bypass black-box machine learning based detection models. The presented algorithm turns out to be much better than a traditional gradient-based example of generation algorithms and is able to decrease the detection rate to nearly zero. The algorithm is quite obvious, the system takes original malware samples as inputs and outputs adversarial examples based on a sample and some noise. The non-linear structure of neural networks enables them to generate more complex and flexible examples to trick the target model.
I mentioned earlier that there are three main attack purposes: espionage sabotage and fraud, and most of them are carried out by malware. Nevertheless, there is another relatively new type of attacks that can be considered as a sabotage and it’s dubbed Crowdturfing. Put it simply, crowdturfing is a malicious use of crowdsourcing services. For example, an attacker pays workers some cash to write negative online reviews for a competing business. Since real people write them, these reviews often go undetected since automated tools are looking for software attackers.
The other options may be mass following, DoS attacks or the generation of fake information such as fake news. With the help of machine learning, cybercriminals can reduce costs on these attacks and automate them. In the “Automated Crowdturfing Attacks and Defenses in Online Review Systems” research published in September 2017, introduced an example of system that generates fake reviews on Yelp. The advantage was not just great reviews that can’t be detected, but the reviews with better scores comparing to ones from a human.
Machine learning for cybercrime automation
Experienced hackers can use machine learning in various areas to automate the necessary tasks. It’s difficult to tell when and what exactly will be automated, but awareness of that cybercrime organizations have hundreds of members requires different types of software e.g. support portals or support bots.
As of specific cybercrime tasks there is a new term – Hivenet – standing for smart botnets. The idea is that if botnets are managed manually by cybercriminals, hivenets can have a sort of brain to reach a particular event and depending on them change behavior. Multiple bots will sit in devices, and depending on the task they will decide who will use a victim’s resources now. It’s like a chain of parasites living in the organisms. The bots will have their collective brain and here I would like to stop reflecting on the next steps and particular examples and to spend this time on the next article, which will tell about machine learning for Defense.