Written By Jesse Sampson And Presented By Charles Leaver CEO Ziften
In the very first about edit distance, we took a look at hunting for harmful executables with edit distance (i.e., the number of character edits it takes to make two matching text strings). Now let’s take a look at how we can use edit distance to search for harmful domains, and how we can develop edit distance features that can be combined with other domain features to pinpoint suspicious activity.
Case Study Background
What are bad actors trying to do with harmful domains? It might be simply utilizing a similar spelling of a typical domain name to fool careless users into looking at advertisements or getting adware. Genuine websites are slowly catching onto this technique, sometimes called typo-squatting.
Other destructive domain names are the product of domain generation algorithms, which can be used to do all types of nefarious things like evade counter measures that obstruct recognized compromised websites, or overwhelm domain servers in a distributed denial of service attack. Older variations use randomly generated strings, while further advanced ones add techniques like injecting typical words, further puzzling protectors.
Edit distance can help with both usage cases: here we will find out how. Initially, we’ll leave out typical domains, since these are normally safe. And, a list of regular domains supplies a baseline for discovering anomalies. One excellent source is Quantcast. For this conversation, we will adhere to domains and prevent subdomains (e.g. ziften.com, not www.ziften.com).
After data cleaning, we compare each candidate domain name (input data observed in the wild by Ziften) to its possible neighbors in the very same top level domain (the last part of a domain name – classically.com,. org, and so on now can be practically anything). The standard job is to find the closest next-door neighbor in terms of edit distance. By discovering domains that are one step away from their closest next-door neighbor, we can easily identify typo-ed domains. By discovering domain names far from their neighbor (the stabilized edit distance we presented in Part 1 is beneficial here), we can likewise find anomalous domain names in the edit distance area.
What were the Outcomes?
Let’s take a look at how these outcomes appear in reality. Be careful when browsing to these domains considering that they could consist of destructive content!
Here are a few possible typos. Typo squatters target popular domains considering that there are more possibilities someone will visit. Several of these are suspect in accordance with our danger feed partners, however there are some false positives as well with cute names like “wikipedal”.
Here are some odd looking domains far from their neighbors.
So now we have created two useful edit distance metrics for searching. Not just that, we have three features to potentially add to a machine learning model: rank of nearest neighbor, distance from next-door neighbor, and edit distance 1 from neighbor, indicating a danger of typo tricks. Other features that might play well with these are other lexical functions like word and n-gram distributions, entropy, and the length of the string – and network features like the total count of failed DNS requests.
Simplified Code that you can Play Around with
Here is a streamlined version of the code to have fun with! Developed on HP Vertica, but this SQL should function with a lot of innovative databases. Note the Vertica editDistance function might vary in other applications (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).