Thursday, May 30, 2013

Errorists' Empire - Exploring Typo Networks

Have you ever mistyped a web-address? Of course you have and you are not alone. Every day a vast amount of web requests are sent to typo addresses - close cousins of normal major web-site save for one or two characters. When you consider the amount of traffic that typo-sites can receive it is easy to understand why these domains are so valuable. For years, "Typosquatters" have been reaping the value from bad typing using inventive means to generate revenue. 

However, what is interesting is that a significant number of typo-sites can be observed using identifical methods to mislead users into installing a common set of binaries onto PCs (more below). 

During this review, I found over 341 significant typo-sites using related tactics, hosting resources, and executables;  These sites included numerous typo variations that easily receive traffic for intended major sites like: Blogspot, Craigslist, Foxsports, Google, Gmail, Hotmail, Linkedin,,,

A more comprehensive list of these typo-sites can be found on pastebin for those interested.

Looking at some of these address names, it doesn't take a great deal of imagination to realize that the operators behind this are delivering binaries to a large number of systems daily.  

This leads us to some interesting questions
  • What are the methods used by this group?
  • What is the monetization process driving these efforts?
  • What parties are involved in this activity?
To start our investigation, let's  take a look at what has to probably be one of the most successful typo sites of all time: GMAI.COM.

Link Analysis (GMAI.COM)
Around 5/30/2013, if you happened to miss the "L" in you'd start your way down a link-path that goes something like this:

 - To start off with, -( Securehost/Bahamas - virustotal) serves us a 302 redirection bounce in the initial HTTP response and we get passed to an interesting web-site "" ( / Netelligent Canada) which provides persistent URL redirection/cloaking service:

   Example of web-redirection service 
 (same services also found on //
- This redirection service hands us a java-script redirect to:
The site is on round-robin DNS with the IPs: (Amazon EC2/US - urlquery, virustotal), (Amazon EC2/US -  urlquery, virustotal), and (Securehost/Bahamas . urlquery,virustotal) .

After reaching this final destination, the site gives us a nice warning and notice that we are required to upgrade to "Adobe Flash 11.0" to proceed. 

This page page also sports some noteworthy disclaimer language; the basic gist of this is that you are going to be provided with "customer installer" which is different from what you were likely expecting.

At this point, pretty much anything we click on here serves us up a binary; however the specific file we receive varies probably on factors like which server we hit and our user agent string.

Since they very much want us to take these executables, let's do that and see what we can learn:

Binary Review
Listed below is information on three binaries that I obtained on different loads of the site: 

Binary Name
Analysis Links
Code - Signed
Air Software
Optimum Installer
“ExtremeFlashPlayer_Ytz Installer.exe”
Denco Ltd.

Though these EXEs are served from different sites, they share a great deal in common:
  • Each is pushed via a Pay-Per-Install Provider framework (more on this below).
  • Only a handful of AV engines detect them; those that do categorize them as adware/spyware or potentially unwanted programs (PUPs).
  • Each of the binaries is code signed (Versign class 3).
  • Each makes a common series of registry changes used to significantly downgrade the security setting of Internet Explorer.
Pay-Per Install Economy
From the binaries we pulled (see above), we can see evidence that these typo sites are being used as vast "install funnel" for driving a pay-per-install profit chain.

The overall PPI economy has distinctive roles that we can map to what we have seen so far in our review.
 While initiation of the the chain starts with the typo-squatter install funnel, the money actually flows from the "application providers" back up the chain.  

The application providers buy access into the PPI provider's affiliate network and use of their installation framework. In many cases their purchase can be specific to a region for the installation and an install target number (i.e. 1000 S. American PCs). Additionally, the PPI provider also frequently offers free software that is bundled/bound with the app. providers own code.   In turn, the PPI providers issue payment to our  typosquatters who can drive vast amounts of traffic and downloads to catalyze the overall chain.

For the end-user who was tricked into installing the PPI installer software, there really is no good outcome.  At best, they may have some adware or spyware that mucks things up a bit and of course they had no intention of installing to start with.   However at worse, malware authors have also for many years leveraged PPI networks as quick ways to build/grow their botnet.

Who Is Involved In This?
It is easy to spot the PPI provider companies whose installers are getting used as well many companies whose adware/spyware is piggy-backing on these initial installs.  However, it is considerably harder (and therefore more interesting)  to look at who may be behind all of the large number of typo domains that are being used to form the install funnel. One reason these companies are harder to find is that their litigation risks are quite significant due to claims of trademark infringement in addition to formal domain name disputes.  Also a significant measure of their value to the PPI chain is their resilience to disruption or take-down attempts.

To dig a little deeper, we can turn to Maltego  for some help with quick OSINT (report file here).

Looking at the nodes associated with the typo-domains we can find some clear relationships and trends that stand-out:

  1. Hosting Trends - The domain use round-robin DNS between the IPs ( / Amazon EC2 - virustotal / urlquery  and Softlayer - virustotal / urlquery ) .  Of note, the whois record for contains a rwhois referral which shows that PPX International Limited (now YTZ Management) as the organization responsible for the IP.
%rwhois V-1.5:003fff:00 (by Network Solutions, Inc. V-
network:Organization;I:PPX International Limited
network:Street-Address:250 Lytton Blvd
network:Tech-Contact;I:[email protected]
network:Abuse-Contact;I:[email protected]
network:Created:2010-11-19 15:00:01
network:Updated:2011-02-01 20:47:57
network:Updated-By:[email protected]

      2. Registrar Trends - Almost all domains use private registration and as the registrar.

     3.  DNS Trends - Most domains are using DomainManager as the authoritative nameservers.

If you research these three companies, you'll find they are tightly clustered together both terms of their history, investors, and leadership.  These companies are backed by folks who have been in the domain business for many years.  Evidence suggests that they are currently major players whose services are collectively being used to support a platform for aggregating untargetted traffic and focusing it into the PPI pipeline.

Typo-squatters and commercial PPI companies represent themselves as being engaged in legitimate businesses.  It worth noting that this legitimate business seems to require frequent adoption of techniques used to evade attempts to block or shutdown these services (off-shore bullet proof hosting, url redirection, binary packing).  Irregardless of questions of legality (variance in international laws), the use of misleading tactics to trick users into installing software is far from honorable and can results in serious loss of productivity.  If you've ever had to help a family member clean up after one of these installers pushed adware/spyware to a system you know just how ugly and frustrating this can be (example below).

More seriously however,  the success of using typo sites to increase the number of systems tied to a PPI network only increases incentive for cyber-criminals to view these networks as viable delivery vehicles for widespread use.   If  a PPI commercial provider can offer you access to hundreds of thousands of systems in affluent regions, then  you can potentially from nothing to a major botnet very quickly.   Research conducted in 2011 (see some the links below) clearly demonstrated that commercial PPI installers are a common target for infiltration and use for infection.

Further Readings/References

Juan Caballero, Chris Grier, Christian Kreibich, and Vern Paxson 

MIT Technlogy Review: Most Malware Tied to 'Pay-Per-Install' Market

"Hacker-Howto" PPI  + Malware Redistribution:

Thursday, May 16, 2013

Building Password Dictionaries From Evidence Images

When dealing with a forensic image that contains encrypted files, our best friends are often those ever so helpful post-it notes, weak passwords, or instances of password reuse involving encoding methods that are easily defeated. However fortune doesn't always favor the forensicator, and periodically you have to look for another shortcut for recovering encrypted content.

One approach that can help with this is to build a password dictionary from printable character strings contained within evidence images. The basic idea is that a user may have stored their password (or a derivation of it) somewhere on the original media or that the password might still be retained on an internal page or swap file.  

A reason to consider this approach is that the generation and use of a dictionary file can be achieved relatively quickly. Whereas, a brute-force attack against decently complex password >  6 chars can potentially take a very long time if you're up against a good cipher.

My initial forays into building case-specific password dictionaries involved the Linux string command, sed, awk, grep and LOTS of pipes;  The overall processing time for this method was rather slow (basically run it and go to bed).  However, using the incredibly versatile bulk_extractor tool by Dr. Simson Garfinkle (available in latest update of RŌNIN-Linux R1) we can generate a media-specific dictionary file fairly quickly.

If you've never used bulk_extractor before then I recommend checking out its ForensikWiki entry. The scope and utility of this tool is much broader than the topic of this post.

Here are some quick steps on building a case dictionary file using bulk_extractor and cracklib.

Using Bulk_Extractor To Build Initial WordList
With the command listed below:  we are disabling all other scanners available in bulk_extractor (-E ) save for the wordlist scanner, we are outputting the generated wordlist in specific directory (-o), and we are designation the image to be evaluated.  The default settings here will extract words between 6 to 14 characters long and this is adjustable with the -w flag.
$ bulk_extractor -E wordlist -o /tmp/bulk_extractor/ evidence1.raw

bulk_extractor version: 1.3.1
Hostname: valkyrie
Input file: evidence1.raw
Output directory: /tmp/
Disk Size: 120000000000
Threads: 2

. . .

15:46:32 Offset 119973MB (99.98%) Done in  0:00:00 at 15:46:32
All Data is Read; waiting for threads to finish...
All Threads Finished!
Producer time spent waiting: 0 sec.
Average consumer time spent waiting: 3059 sec.
** bulk_extractor is probably I/O bound. **
**        Run with a faster drive        **
**      to get better performance.       **
Phase 2. Shutting down scanners
Phase 3. Uniquifying and recombining wordlist
Phase 3. Creating Histograms
   ccn histogram...   ccn_track2 histogram...   domain histogram...
   email histogram...   ether histogram...   find histogram...
   ip histogram...   tcp histogram...   telephone histogram...
   url histogram...   url microsoft-live...   url services...
   url facebook-address...   url facebook-id...   url searches...

Elapsed time: 3304 sec.
Overall performance: 34.58 MBytes/sec.

A few cool things to note about bulk_extractor scans "beneath the hood":
  • The scan method employed by bulk_extractor is 100% "agnostic" concerning the actual filesystem contained within the image. We can throw any digital content at it.
  •  Bulk_extractor employs parallelization for performance. The data read from image is split into 16M pages with one thread per core committed to processing each page.
  • Bulk_extractor is able to pick up where it left off.  If we kill this process and restart, then bulk_extractor will read its last read offset from our output folder and begin there.
Converting the Word-List to Dictionary
After the run has completed, we will find a wordlist_split_000.txt file in our output directory.
A quick evaluation of this file shows us that bulk_extractor has extracted 388,950 unique potential password strings.
$ wc -l ./wordlist_split_000.txt 
388950 wordlist_split_000.txt

Obviously the majority of entries contained in our wordlist_split_000.txt file are junk.  If desired, we can clean this dictionary up a bit more as well as obtain some string derivations by using the cracklib utility cracklib-format:
$ cracklib-format wordlist_split_000.txt > wordlist.crack

Cracklib-format performs a few filtering actions here:
  • Lowercases all words
  • Remove Control Characters
  • Sorts Lists
Since decent password cracking tools will employ case variance we often don't lose too much with this clean-up. However, retaining the wordlist_split_000.txt file is a good idea should your password cracking tool not support this.

Another option for reducing the password list to an even shorter set, is to use cracklib-check to create a list of weak passwords (short, dictionary based).

$cat wordlist.crack |cracklib-check| egrep -v "OK"|tr ":" " "|awk {'print $1'} > wordlist.weak

Ideas? / Further Reading
Do you have another tool,method, or process that you use for this?  I'd love to hear about it.

Here are a few other links that are useful/relevant:

Tuesday, May 7, 2013

Setting Up A Forensic Hash Server Using Nsrlsvr

When working a case involving media that contains operating system, application, and user data files,  it is important to be able to efficiently and reliably differentiate files that warrant examination from those that may be normal system files.   One effective way to do this is to set up a forensic hash server on your analysis network. A forensic hash server centralizes your repository of hash-sets for known files as well as provides dedicated resources for managing hash queries. Thanks to the great work done by Rob Hansen with support from RedJack Security, we can easily setup our own hash server using his Nsrlsvr project.

Nsrlsvr Overview

Nsrlsvr is a C++ application that can be compiled on Linux or OSX.  It takes its name from the National Software Reference Library (NSRL) project which is maintained by NIST and supported by the Department of Homeland Security and other law enforcement agencies. The NSRL is an extremely large database of  known/valid application files, their file hashes (md5,sha1sum), and associated metadata.  While the NSRL is a wonderful resource, its overall size (over 30 million entries) and flat text format make it unwieldy to run a large number of queries against (trust me - you don't want to grep or findstr against this). That is where Nsrlsvr comes in.   Nsrlsvr loads this data set into memory and makes it easy to perform bulk hash lookups using standard open-source forensic tools (in particular md5deep).

Server Setup: Compiling Nsrlsvr

Compiling Nsrlsvr is not difficult provided you have enough disk space and a few gigs of RAM.   Here are some basic setup instructions for getting this running under RŌNIN-Linux.

Step1 -  Download zip of latest release of Nsrlsvr.

Step 2. Basic Compile
( Note: During the configuration stage, scripts will download the NSRL database and process it; this may take some time depending on your bandwidth and system resources.)

sudo apt-get install build-essential
unzip ./rjhansen-nsrlsvr*.zip
cd ./rjhansen-nsrlsvr*
./configure && make
sudo make install
At the end of the build,  you should have a nsrlsvr binary (/usr/local/bin/nsrlsvr ) as well as a master hash table extracted from the NSRL data-set (/usr/local/share/nsrlsvr/NSRLFile.txt ).

Launching Nsrlsvr 
If you take a look at the man page for Nsrlsvr, you'll that it is really easy to fire-up following installation. To spawn a nsrlsvr daemon that is loaded with the NSRL reference data set, we can simply issue the command (can drop this in rc.local to run on each boot):

( Note: The default tcp port for this process will be 9120. Also, the nsrlsvr daemon consumes a good bit of RAM when loading the NSRL reference data set. Developer recommends 8GB RAM and 64-bit OS for adequate performance).

Client Setup: Compiling Nsrllookup
Our analysis systems will also need software installed to be able to issue hash lookup queries queries to the Nsrlsvr daemon.  To handle this we will install Nsrllookup.

Linux compile instructions are below.  If your analysis systems run Windows, the developer also provides pre-compiled binaries  (32-bit, 64-bit).

Step1. Download zip of the latest release of Nsrllookup.

Step 2. Basic Compile
sudo apt-get install build-essential
unzip ./rjhansen-nsrllookup*.zip
cd ./rjhansen-nsrllookup*
./configure && make
sudo make install

At the end of this build, you should have a nsrllookup binary (/usr/local/bin/nsrlookup).

Performing Hash Lookups
Now that we have the server up and running and our client has a query tool installed, we can start performing hash lookups.  To do this we will use the md5deep utility to compute the hashes and Nsrllookup to issue queries against our hash server:
md5deep -r ./image_mount_point/|nsrllookup -K known_files.txt -U unknown_files.txt -s hashserverip

With this command, we are using md5deep to perform a recursive scan (-r) of all files contained within our image mount directory. We are piping the returned hash values to Nsrlookup which is in turn querying our central hash server.  The flags (-K, -U) sort queried files into two categories based on whether  files matches (known) an entry in the NSRL reference data set or they are not matched (unknown).  With these two report files, we are now able to focus-in our review efforts on those entries/objects which were not located in the NSRL database.

Using Custom HashSets
Nsrlsvr is also capable of  loading custom hash sets that you provide.  This is a useful function as you can launch multiple nsrlserv processes on varied ports that allow you to query against different hash sets.

If you're responsible for DFIR in corporate or other enterprise computing environment, this function can be really useful for building and loading hashes from desktop and server gold build images.  Another usage idea would be to create a cron job that generates hash files (and/or piecewise hashes)  for any malicious files (in-house zoo), illegal images, or other content that you might want to do initial sweeps for early on in an investigation.  To build a custom hash set, we use md5deep and perform some string manipulations to get it into a format that Nsrlserv will readily parse (see below).

md5deep -r -c /media/goldimage/|tr '[:lower:]' '[:upper:]'|tr "," "
"|awk {'print $1'} > goldimage.hash

We can then fire-up and background another Nsrlserv process by doing the following:
nohup nsrlsvr -S -f goldimage.hash -p 7070 2 & >&1
(This binds the new nsrlsvr process to tcp port 7070).

To run a query against this new listener we can point nsrlookup on our client to this new port (7070) and print known files in our custom hash set (-k for known):
md5deep -r /image_mount_point/|nsrllookup -k -s serverip -p 7070

We can also actually chain queries using these multiple nsrlsvr listeners.  For example, if you want to list all files whose hash values do not match any entry (-u for unknown) in both the core NSRL data set or your custom data set; you can do something like this:
md5deep -r /image_mount_point/|nsrllookup -u|nsrllookup -u -s serverip -p 7070 

As we can see, Nsrlserv and Nsrllookup are really useful resources to help with data reduction at the onset of an investigative case as well as for quick review of content that you want to flag.  

Thursday, May 2, 2013

Quick DLP Scans With ClamAV

Did you know that ClamAV has a DLP module that can scan for credit cards or social security numbers contained in files? One reason that it is interesting is that ClamAV is found on almost all linux security distros (including RŌNIN) and is easily launched from the command line.  If you've ever worked breach cases in data environments covered under PCI-DSS or HIPAA, you know that one of the first questions to answer is: Did personally identifiable information (PII) exist on the compromised system?  To that endhaving a quick and readily available DLP scanning tool is a useful capability.

Running DLP Scan Using ClamScan
You can run a DLP (and AV sweep) using the ClamAv command line scanner, clamscan,  and following options:

clamscan -r --detect-structured=yes --structured-ssn-format=2 --structured-ssn-count=5 --structured-cc-count=5 directorypath
Command breakdown
-r  (recursive file scanning)
--detect-structured (yes turns on DLP matching. no by default)
--structured-ssn-format=2  (this tells scanner to match both ###-##-#### and #########).
--structured--ssn-count  (number of ssn matches/hits to exceed before reporting)
--structured-cc-count (number of ccn matches/hits to exceed before reporting)

Testing ClamAV DLP Module
To test ClamAV's DLP module, you can use a great DLP test data-set provided by IdentityFinder. This data-set is comprised of a number of files that contain fake ssns, ccns, and other elements of PII distributed across a wide range of common file formats.

If we fire off a scan of this data set using clamscan we get the following results:

clamscan -r --detect-structured=yes --structured-ssn-format=2 --structured-ssn-count=1 --structured-cc-count=1 ./Identity_Finder_Test_Data
./Identity_Finder_Test_Data/Employee Database.accdb: OK
./Identity_Finder_Test_Data/Hidden Column.xls: OK
./Identity_Finder_Test_Data/Department.csv: Heuristics.Structured.SSN FOUND
./Identity_Finder_Test_Data/college essay w footer.doc: OK
./Identity_Finder_Test_Data/Fake SSNs/fake_ssn.txt: Heuristics.Structured.SSN FOUND
./Identity_Finder_Test_Data/Contacts.pptx: Heuristics.Structured.SSN FOUND
./Identity_Finder_Test_Data/loans.xlsx: Heuristics.Structured.SSN FOUND
./Identity_Finder_Test_Data/Samples/SSN.txt: Heuristics.Structured.SSN FOUND
./Identity_Finder_Test_Data/Samples/Sample Real CCN.txt: Heuristics.Structured.CreditCardNumber FOUND
./Identity_Finder_Test_Data/2009 class.docx: Heuristics.Structured.SSN FOUND
./Identity_Finder_Test_Data/Tax Return 2008.pdf: Heuristics.Structured.CreditCardNumber FOUND
./Identity_Finder_Test_Data/Credit Report.pdf: Heuristics.Structured.CreditCardNumber FOUND
./Identity_Finder_Test_Data/Employee Database.mdb: OK
./Identity_Finder_Test_Data/ Heuristics.Structured.CreditCardNumber FOUND
./Identity_Finder_Test_Data/application.pdf: Heuristics.Structured.CreditCardNumber FOUND
./Identity_Finder_Test_Data/students.ppt: Heuristics.Structured.SSN FOUND

From the output we can see that ClamAV found PII in a large number but not all of these files (which we should have with low count levels). In particular, the DLP module seems to have a hard time identifying PII contained in access database files, excel docs with hidden columns, and word document footers. As ClamAV's DLP functionality is based on parsing binary streams for matches on structured data (regex), it seems to have issues with formats that do not employ straight-forward textual encoding.

For a comprehensive DLP sweep, we'd want to look to a  tool like OpenDLP or commercial tools like Identity Finder. However for a quick initial review, ClamAV's DLP scanning features are very good for performing cursory assessments.