New research from AT&T is seeking to create logical procedures to identify and classify the words in malicious domain names – and have found some unusual results. The paper Breaking Bad: Detecting malicious domains using word segmentation [PDF] by Wei Wang and Kenneth Shirley investigates to what extent the individuation of words in primary domain names (rather than URLs which point to pages or content beyond the URL root) can accurately predict whether the domain is connected to illicit or illegal activity, such as being used as a Command-and-Control (C&C) server, or to dispense malicious binaries to users who have been tricked into visiting them, often by phishing techniques or redirects.

The group used 15,000 ‘benign’ URLs from AOL’s Open Directory Project (DMOZ) as a control set against varied, unspecified sources for malicious domains, and noted a verifiable improvement over current URL analysis which seeks the same result but does not attempt to extract ‘whole’ words from root domain names.

The lexicon of malice proved to be largely predictable, including such unloved favourites as medic, pills, loan, fee, cash, payday, pharmacy, webcams, cams, lover, sex, porno, as well as varied references to luxury goods such as Ray-Ban sunglasses and higher-end brands such as Timberland, Ugg and Tiffany.

But also associated with mobile domains, mystifyingly, appear to be notable basketball players such as LeBron James, Kobe Bryant and Michael Jordan. The report notes:

‘It was especially interesting to be able to discover that certain basketball players’ names are associated with malicious domains – theoretically, these names would change over time as new basketball players became popular, and a model like M7 [the group’s proposed method], trained on a sliding window of fresh data, would detect the new names,’

Wang and Shirley were also able to identify words most commonly associated with benign domains, including Texas, Europe, Vermont, Washington, Colorado – and, by way of a contrasting sport, golf.

Though in general the inclusion of numbers (digits) in a domain was a negative indication for the passivity of the site, certain sequences of numbers – again, not without a sporting reference – are unusually ‘safe’ at the moment, including 411, 365 and 123.

The report indicates that the further development of word-segmentation could ‘apply near-real time detection’ of malicious domains when browsing on a mobile, and concludes:

‘If a domain is estimated to have a high probability of being malicious based solely on its name, then a more expensive analysis (such as web content-based analysis) could be used to determine further action, such as blocking the site or inserting a “speed bump”. In this way, the word segmentation techniques described here could improve existing systems that use machine learning to detect malicious domains by generating thousands of additional features with which to classify domains,’