When we began sketching out a system to solve this problem, we encountered issues others have faced: every company or vendor uses their own data formats, a consistent vocabulary is rare, and each threat type can look very different from the next. With that in mind, we set about building what we now call ThreatData, a framework for importing information about badness on the Internet in arbitrary formats, storing it efficiently, and making it accessible for both real-time defensive systems and long-term analysis.
Design, Starting With Feeds
The ThreatData framework is comprised of three high-level parts: feeds, data storage, and real-time response. Feeds collect data from a specific source and are implemented via a light-weight interface. The data can be in nearly any format and is transformed by the feed into a simple schema we call a ThreatDatum. The datum is capable of storing not only the basics of the threat (e.g., evil-malware-domain.biz) but also the context in which it was bad. The added context is used in other parts of the framework to make more informed, automatic decisions.
Here are some examples of feeds we have implemented:
- Malware file hashes from VirusTotal ;
- Malicious URLs from multiple open source blogs and malware tracking sites;
- Vendor-generated threat intelligence we purchase;
- Facebook's internal sources of threat intelligence; and
- Browser extensions for importing data as a Facebook security team member reads an article, blog, or
Once a feed has transformed the raw data, it is fed into two of our existing data repository technologies: Hive and Scuba.
We use Hive storage to answer questions based on long-term data:
- Have we ever seen this threat before?
- What type of threat is more prevalent from our perspective: malware or phishing?
Scuba gives us the opposite end of the analysis spectrum:
- What new malware are we seeing today?
- Where are most of the new phishing sites?
Maintaining accurate threat databases is great and can help answer challenging questions, but that's only part of the challenge in protecting the graph. We also need to quickly and consistently address threats that come to our attention. To help us, we built a processor to examine ThreatDatum at the time of logging and act on each of these new threats. Here are some examples we've implemented so far:
- All malicious URLs collected from any feed are sent to the same blacklist used to protect people on facebook.com;
- Interesting malware file hashes are automatically downloaded from known malware repositories, stored, and sent for automated analysis; and
- Threat data is propagated to our homegrown security event management system, which is used to protect Facebook's corporate networks.
Now that we have the ThreatData framework in place, we continue to iterate on it, more Facebook engineers are hacking on it, and we are bringing in new types of threats. Along the way, we've had some interesting discoveries.
Feature Phone Malware
In the summer of 2013, we noticed a spike in malware samples containing the string 'J2ME' in the anti-virus signature. Further investigation revealed a spam campaign using fake Facebook accounts to send links to malware designed for feature phones. The malware, specifically the Trojan:J2ME/Boxer family , was capable of stealing a victim's address book, sending premium SMS spam, and using the phone's camera to take pictures. With this discovery, we were able to analyze the malware, disrupt the spam campaign, and work with partners to disrupt the botnet's infrastructure. Below is chart of a similar campaign attempted in December 2013.
December 2013 spam campaign attempting to spread Trojan.J2ME.Boxer malware;
Blue is unique URLs, Red is unique binaries
In a typical corporate environment, a single anti-virus product is deployed to all devices and used as a core defense. In reality, however, no single anti-virus product will detect all threats. Some vendors are great at detecting certain types of malware, while others can detect a wide array of threats but are more likely to mislabel them. We decided we would employ our framework to construct a light-weight set of hashes expressly not detected by our chosen anti-virus product and feed those hashes directly into our custom security event management system. The results have been impressive: we've detected both adware and malware installed on visiting vendor computers that no single anti-virus product could have found for us.
As part of the ThreatData framework, we have growing capabilities to decorate the data with additional context at logging time. For example, we add Autonomous System, ISP, and country-level geocoding on every malicious or victimized IP address logged to the repository. As a result, we can understand where threats are coming from, arranged by type of attack, time, and frequency. The map below shows a heat map of one month's worth of data with the ASN/ISP/Country data decoration, including color shading where one shade reflects the combined volume of both malicious and victimized IP addresses in one view. The inset pie chart breaks out U.S. IP addresses by ISP. Charts like this, which an analyst can build in under a minute, are used by Facebook's security teams to drive the relationships we build with other companies and daily remediation actions.
World map of malicious and victimized IP addresses, with inset of United States IP addresses broken out by Internet service provider
Discoveries and detection capabilities like these are just the tip of the iceberg. We're constantly finding new ways to improve and extend the ThreatData framework to encompass new threats and make smarter decisions with the ones we've already identified. We realize that not all aspects of this approach are entirely novel, but we wanted to share what has worked for us to help spark new ideas. We've found that the framework lets us easily incorporate fresh types of data and quickly hook into new and existing internal systems, regardless of their technology stack or how they conceptualize threats.
By Mark Hammell