Max Duijsens
About computer security and other hobbies

VirusTotal Lookups in Splunk

As part of my task as Technical Lead of a honeypot project I collect logs from the honeypots that are forwarded to our Splunk setup. Part of those logs are md5 hashes of all malware caught with the honeypots. As we are gradually switching to a Modern Honey Network setup, I noticed that their Splunk implementation adds a link to each event containing an md5 hash. This link points to VirusTotal (VT) scan results. However, it would be much more interesting to have actual VT scan results directly in Splunk so you can search for new or ‘clean’ malware. In this post I explain how to add scan results from VirusTotal to your Splunk events. Note that this solution doesn’t scale well so if you are running a large enterprise Splunk you might want to consider getting a VirusTotal API key which allows more requests per minute instead of the free one which allows for up to 4 requests per any given 60sec window.

Splunk has a feature called lookup tables. I use this feature in combination with a Python script to lookup md5 hashes on VirusTotal. This is not difficult at all but there is a major problem: VirusTotal only allows you to make 4 requests per minute. Grabbing the scan results live (while the events are displayed) is therefore not an option. Lookup tables to the rescue!

We start by making a python script which does the actual lookup. I took a script by Didier Stevens. His script takes as input a txt file with a list of md5 hashes and runs a search query on VirusTotal in order to grab the scan results. The scan results are stored as a csv. His script separates the output csv file with semicolons (;) which is the European way of formatting csv files. However, Splunk does not take these semicolons and expects comma’s as field separator. So this part was changed in his script. There is a problem when using comma’s since the scan results from VirusTotal also contain comma’s. Therefore quotes where inserted around the scan results field allowing Splunk to understand that this field is a single string and should be interpreted as such. Another change was made to append the output csv file (without headers) instead of overwriting it upon each run. When appending, the output lookup table will only grow and md5 results will never be deleted.

Now to get the input list of md5 hashes, we can run a (scripted) Splunk query with Splunk installed in /opt/splunk:

/opt/splunk/bin/splunk search "source="*mhn-splunk.log*" AND md5 != None | dedup md5 | table md5" -minutesago 30 -output csv > md5hashes_30m.txt

This outputs a single list of md5 hashes found in the last 30minutes by Splunk. We then run the VirusTotal checker on this list to search for all these md5 hashes and output the results to /opt/splunk/etc/system/lookups/vtlookup.csv (which is where the global lookup tables should be stored).

We need one more modification to the vtchecker script, since it will simply lookup all the md5 hashes listed in the input txt file. Therefore it will lookup many duplicates (many hashes get caught by multiple honeypots in different timeframes) so we need to add a history keeping feature. This was achieved by logging all md5 hashes previously seen in a log file (around line 199):

seenfile = open("md5seen.txt", "r")
for line in seenfile:
line = line.rstrip()
if line in searchTerms:
searchTerms.remove(line)
seenfile.close()

This snippet simply removes all previously seen md5 hashes from the searchTerms list. Then, around line 209 we simply write out each md5 hash after it’s checked into this md5seen.txt file: seenfile.write(searchTerms[0] + "\n")

Now that we can build the lookup table, we have to tell Splunk when it has to look up data from this table and how to add it to the events as they are displayed in a query. This is done via props.conf and transforms.conf. Transforms.conf defines the path to the csv file and gives it a name. In props.conf the actual lookup is defined:

/opt/splunk/etc/system/local/transforms.conf:

[vtlookup]
filename = vtlookup.csv

/opt/splunk/etc/system/local/props.conf:

[mhn-splunk-2]
LOOKUP-vtlookup = vtlookup md5 OUTPUTNEW Response as vt_found, Scan_Date as vt_date, Detections as vt_detections, Total as vt_total, Permalink as vt_url, AVs as vt_details

This code is pretty straightforward and looks a lot like a sql query. The [mhn-splunk-2] tells Splunk to which event types this lookup should be performed. In this case only events of the eventtype mhn-splunk-2 will get the vt data augmentation. The keyword OUTPUTNEW defines that some fields are added to each event. It will add multiple fields starting with vt_ which give details on the result of this md5 hash.

Once we defined all this, crontab the md5 lookup script and have it run at least once so the lookup table is created and some md5 results are in there. Then you can build queries that query for the vt_ fields like the following query to search for md5 hashes that are either not found on VT or have 0 detections:

sourcetype = mhn-splunk-2 vt_found=0 OR vt_detections = 0 | dedup md5 | table md5, vt_total, vt_detections

Concluding, we now have a nice way to query for unknown or marked clean malware. We can use this query to further analyze the actual binaries behind the md5 hashes which come up clean, which for us is a nice way to do data reduction. We no longer need to analyze all binaries, just the ones that come up clean or unknown on VirusTotal as those are the interesting pieces. A future expansion could be to actually submit unknown samples from our own data repository to VirusTotal automatically.

Link to the code on GitHub.