Tuesday, May 7, 2013

Setting Up A Forensic Hash Server Using Nsrlsvr

When working a case involving media that contains operating system, application, and user data files,  it is important to be able to efficiently and reliably differentiate files that warrant examination from those that may be normal system files.   One effective way to do this is to set up a forensic hash server on your analysis network. A forensic hash server centralizes your repository of hash-sets for known files as well as provides dedicated resources for managing hash queries. Thanks to the great work done by Rob Hansen with support from RedJack Security, we can easily setup our own hash server using his Nsrlsvr project.

Nsrlsvr Overview

Nsrlsvr is a C++ application that can be compiled on Linux or OSX.  It takes its name from the National Software Reference Library (NSRL) project which is maintained by NIST and supported by the Department of Homeland Security and other law enforcement agencies. The NSRL is an extremely large database of  known/valid application files, their file hashes (md5,sha1sum), and associated metadata.  While the NSRL is a wonderful resource, its overall size (over 30 million entries) and flat text format make it unwieldy to run a large number of queries against (trust me - you don't want to grep or findstr against this). That is where Nsrlsvr comes in.   Nsrlsvr loads this data set into memory and makes it easy to perform bulk hash lookups using standard open-source forensic tools (in particular md5deep).

Server Setup: Compiling Nsrlsvr

Compiling Nsrlsvr is not difficult provided you have enough disk space and a few gigs of RAM.   Here are some basic setup instructions for getting this running under RŌNIN-Linux.

Step1 -  Download zip of latest release of Nsrlsvr.

Step 2. Basic Compile
( Note: During the configuration stage, scripts will download the NSRL database and process it; this may take some time depending on your bandwidth and system resources.)

sudo apt-get install build-essential
unzip ./rjhansen-nsrlsvr*.zip
cd ./rjhansen-nsrlsvr*
./configure && make
sudo make install
At the end of the build,  you should have a nsrlsvr binary (/usr/local/bin/nsrlsvr ) as well as a master hash table extracted from the NSRL data-set (/usr/local/share/nsrlsvr/NSRLFile.txt ).

Launching Nsrlsvr 
If you take a look at the man page for Nsrlsvr, you'll that it is really easy to fire-up following installation. To spawn a nsrlsvr daemon that is loaded with the NSRL reference data set, we can simply issue the command (can drop this in rc.local to run on each boot):

( Note: The default tcp port for this process will be 9120. Also, the nsrlsvr daemon consumes a good bit of RAM when loading the NSRL reference data set. Developer recommends 8GB RAM and 64-bit OS for adequate performance).

Client Setup: Compiling Nsrllookup
Our analysis systems will also need software installed to be able to issue hash lookup queries queries to the Nsrlsvr daemon.  To handle this we will install Nsrllookup.

Linux compile instructions are below.  If your analysis systems run Windows, the developer also provides pre-compiled binaries  (32-bit, 64-bit).

Step1. Download zip of the latest release of Nsrllookup.

Step 2. Basic Compile
sudo apt-get install build-essential
unzip ./rjhansen-nsrllookup*.zip
cd ./rjhansen-nsrllookup*
./configure && make
sudo make install

At the end of this build, you should have a nsrllookup binary (/usr/local/bin/nsrlookup).

Performing Hash Lookups
Now that we have the server up and running and our client has a query tool installed, we can start performing hash lookups.  To do this we will use the md5deep utility to compute the hashes and Nsrllookup to issue queries against our hash server:
md5deep -r ./image_mount_point/|nsrllookup -K known_files.txt -U unknown_files.txt -s hashserverip

With this command, we are using md5deep to perform a recursive scan (-r) of all files contained within our image mount directory. We are piping the returned hash values to Nsrlookup which is in turn querying our central hash server.  The flags (-K, -U) sort queried files into two categories based on whether  files matches (known) an entry in the NSRL reference data set or they are not matched (unknown).  With these two report files, we are now able to focus-in our review efforts on those entries/objects which were not located in the NSRL database.

Using Custom HashSets
Nsrlsvr is also capable of  loading custom hash sets that you provide.  This is a useful function as you can launch multiple nsrlserv processes on varied ports that allow you to query against different hash sets.

If you're responsible for DFIR in corporate or other enterprise computing environment, this function can be really useful for building and loading hashes from desktop and server gold build images.  Another usage idea would be to create a cron job that generates hash files (and/or piecewise hashes)  for any malicious files (in-house zoo), illegal images, or other content that you might want to do initial sweeps for early on in an investigation.  To build a custom hash set, we use md5deep and perform some string manipulations to get it into a format that Nsrlserv will readily parse (see below).

md5deep -r -c /media/goldimage/|tr '[:lower:]' '[:upper:]'|tr "," "
"|awk {'print $1'} > goldimage.hash

We can then fire-up and background another Nsrlserv process by doing the following:
nohup nsrlsvr -S -f goldimage.hash -p 7070 2 & >&1
(This binds the new nsrlsvr process to tcp port 7070).

To run a query against this new listener we can point nsrlookup on our client to this new port (7070) and print known files in our custom hash set (-k for known):
md5deep -r /image_mount_point/|nsrllookup -k -s serverip -p 7070

We can also actually chain queries using these multiple nsrlsvr listeners.  For example, if you want to list all files whose hash values do not match any entry (-u for unknown) in both the core NSRL data set or your custom data set; you can do something like this:
md5deep -r /image_mount_point/|nsrllookup -u|nsrllookup -u -s serverip -p 7070 

As we can see, Nsrlserv and Nsrllookup are really useful resources to help with data reduction at the onset of an investigative case as well as for quick review of content that you want to flag.  


  1. Hi -- this is Rob, the guy behind nsrlsvr and nsrllookup. First, thanks a lot for this review. It's always nice to see people are finding it useful!

    To give a couple of technical details about how nsrlsvr works -- it reads in 30 million hashes from disk as ASCII strings and stores them in balanced tree for rapid lookups. This means first off that it requires a minimum of a few hundred megabytes of memory to run, but it will be scattered all throughout the heap. On 4Gb systems this can create severe memory fragmentation issues. It'll run, just ... less well than I'd like. Give it 8Gb and it should hum quite prettily.

    What you get for this in-memory structure, though, is *volume*. You'll saturate your network connection long before the server stops being responsive. Since it's entirely memory-resident it's quite snappy. You can set up a single nsrlsvr instance and have it used throughout your entire organization.

    This makes it possible to give one person the "nsrladmin hat," and make them responsible for updating hash values in a timely fashion. That's far better than a dozen investigators all setting up their own nsrlsvr instances and six months later no two are using the exact same dataset.

    Anyway -- thank you again for the review, and I hope it continues to be useful to you. If you have any questions, please feel free to drop a note my way. :)

  2. Have you explored using a single file databases such as CDB (http://en.wikipedia.org/wiki/Cdb_(software) or http://cr.yp.to/cdb.html)? It is open source and it will remove the memory requirements and limitations you are hitting and still provide extremely fast lookups. In addition, if you build your CDB database using md5 hashes as keys, you can cut the size of the md5 sum in half from a 32 byte hex string to a 16 byte integer.

    I have worked with a project where we had 35 million known good hashes in a 1.4GB CDB file. We achieved query times in the milliseconds range on a lower end laptop with only 2GB of RAM. Because of the way CDB works you do not have to load the hash data into a glob in RAM.

    Another useful source for hashes is Hashsets.com, they have an extensive list of what you may consider known good to use in addition to the NSRL.

    1. Thanks for your comments. I'm a fan of DJB's code, but haven't come across CDB before. I'll check it out. From your experiences, it sounds very impressive performance wise.

      Also, thanks for pointer to Hashsets.com (also new to me). A good listing of free and commercially available hash-sets would be really useful.

      It appears PassMark also provides some hash-sets that might have utility (Common Keyloggers is nice):

  3. Unfortunately, CDB, SQLite, BDB and the like really aren't in the cards for a lot of reasons.

    1. No user demand for them. Requiring an 8Gb server and a 64-bit OS isn't all that unreasonable. My desktop development machine has 32Gb, for instance.

    2. Third-party components make it harder for users to fully vet the nsrlsvr/nsrllookup system. As it currently stands they're each about 1000 lines of C++ and can be fully audited over a weekend.

    3. Performance. Assuming you're checking a million hashes, you have to send 32Mb of hash data to the server, the server has to process it, and the server then sends 1Mb of lookup data back. With gigabit ethernet over a LAN the transmission time is effectively nil. If each hash lookup is a millisecond and requires a disk access, that means in the best-case scenario you're looking at 15 minutes for nsrlsvr to complete the lookups and get the data back to you. In the worst-case scenario someone's already doing a large lookup and you've got disk I/O contention going on. If the entire structure is kept in memory lookup times are sub-microsecond, which means the million hashes get processed in about 1 second. Further, there is no worst-case scenario since a C++ std::set is safe for simultaneous reads from multiple threads -- there's no analogue to disk I/O contention.

    If someone comes to me with a serious need for CDB functionality I'll consider introducing it, but for right now I think it would be a premature optimization that would harm performance and not give us very much. But I certainly agree that CDB, SQLite, etc., are neat tools and there are tons of areas where they can be used productively.

  4. Hi Rob ... Thanks for this great utility..

    Can you please tell me if these hash look-up queries can be made programmatically as well ? If yes, then will they be made in a similar way like we do for querying any normal database present on a server? What I feel is that it must require some library to be imported first because if we will normally query this database, then those performance speed-ups won't be there that we get when we issue commands via Terminal in Linux..

    I hope I am clear in my question to you..