Thursday, May 16, 2013

Building Password Dictionaries From Evidence Images

When dealing with a forensic image that contains encrypted files, our best friends are often those ever so helpful post-it notes, weak passwords, or instances of password reuse involving encoding methods that are easily defeated. However fortune doesn't always favor the forensicator, and periodically you have to look for another shortcut for recovering encrypted content.

One approach that can help with this is to build a password dictionary from printable character strings contained within evidence images. The basic idea is that a user may have stored their password (or a derivation of it) somewhere on the original media or that the password might still be retained on an internal page or swap file.  

A reason to consider this approach is that the generation and use of a dictionary file can be achieved relatively quickly. Whereas, a brute-force attack against decently complex password >  6 chars can potentially take a very long time if you're up against a good cipher.

My initial forays into building case-specific password dictionaries involved the Linux string command, sed, awk, grep and LOTS of pipes;  The overall processing time for this method was rather slow (basically run it and go to bed).  However, using the incredibly versatile bulk_extractor tool by Dr. Simson Garfinkle (available in latest update of R┼îNIN-Linux R1) we can generate a media-specific dictionary file fairly quickly.

If you've never used bulk_extractor before then I recommend checking out its ForensikWiki entry. The scope and utility of this tool is much broader than the topic of this post.

Here are some quick steps on building a case dictionary file using bulk_extractor and cracklib.

Using Bulk_Extractor To Build Initial WordList
With the command listed below:  we are disabling all other scanners available in bulk_extractor (-E ) save for the wordlist scanner, we are outputting the generated wordlist in specific directory (-o), and we are designation the image to be evaluated.  The default settings here will extract words between 6 to 14 characters long and this is adjustable with the -w flag.
$ bulk_extractor -E wordlist -o /tmp/bulk_extractor/ evidence1.raw

bulk_extractor version: 1.3.1
Hostname: valkyrie
Input file: evidence1.raw
Output directory: /tmp/
Disk Size: 120000000000
Threads: 2

. . .

15:46:32 Offset 119973MB (99.98%) Done in  0:00:00 at 15:46:32
All Data is Read; waiting for threads to finish...
All Threads Finished!
Producer time spent waiting: 0 sec.
Average consumer time spent waiting: 3059 sec.
** bulk_extractor is probably I/O bound. **
**        Run with a faster drive        **
**      to get better performance.       **
Phase 2. Shutting down scanners
Phase 3. Uniquifying and recombining wordlist
Phase 3. Creating Histograms
   ccn histogram...   ccn_track2 histogram...   domain histogram...
   email histogram...   ether histogram...   find histogram...
   ip histogram...   tcp histogram...   telephone histogram...
   url histogram...   url microsoft-live...   url services...
   url facebook-address...   url facebook-id...   url searches...

Elapsed time: 3304 sec.
Overall performance: 34.58 MBytes/sec.

A few cool things to note about bulk_extractor scans "beneath the hood":
  • The scan method employed by bulk_extractor is 100% "agnostic" concerning the actual filesystem contained within the image. We can throw any digital content at it.
  •  Bulk_extractor employs parallelization for performance. The data read from image is split into 16M pages with one thread per core committed to processing each page.
  • Bulk_extractor is able to pick up where it left off.  If we kill this process and restart, then bulk_extractor will read its last read offset from our output folder and begin there.
Converting the Word-List to Dictionary
After the run has completed, we will find a wordlist_split_000.txt file in our output directory.
A quick evaluation of this file shows us that bulk_extractor has extracted 388,950 unique potential password strings.
$ wc -l ./wordlist_split_000.txt 
388950 wordlist_split_000.txt

Obviously the majority of entries contained in our wordlist_split_000.txt file are junk.  If desired, we can clean this dictionary up a bit more as well as obtain some string derivations by using the cracklib utility cracklib-format:
$ cracklib-format wordlist_split_000.txt > wordlist.crack

Cracklib-format performs a few filtering actions here:
  • Lowercases all words
  • Remove Control Characters
  • Sorts Lists
Since decent password cracking tools will employ case variance we often don't lose too much with this clean-up. However, retaining the wordlist_split_000.txt file is a good idea should your password cracking tool not support this.

Another option for reducing the password list to an even shorter set, is to use cracklib-check to create a list of weak passwords (short, dictionary based).

$cat wordlist.crack |cracklib-check| egrep -v "OK"|tr ":" " "|awk {'print $1'} > wordlist.weak

Ideas? / Further Reading
Do you have another tool,method, or process that you use for this?  I'd love to hear about it.

Here are a few other links that are useful/relevant:


  1. Making forensics interesting. Well done sir.

  2. This comment has been removed by a blog administrator.

  3. Great article. I'm new to bulk extractor and going thru the wordlist. When I ran the list came up with more gibberish than any discernable words. Is there a way to compare the bulk extractor wordlist against the english language dictionary (or some other method) to get a list of actual words rather than gibberish?

  4. I'm a new bulk extractor user. Is there a method to get just a list of words rather than a gibberish list of everything? I'm getting so much junk that it is nearly unusuable.