Python Digital Forensics

Python Digital Forensics Resources

Selected Reading

Network Forensics-II

Python Digital Network Forensics-II

The previous chapter dealt with some of the concepts of network forensics using Python. In this chapter, let us understand network forensics using Python at a deeper level.

Web Page Preservation with Beautiful Soup

The World Wide Web (WWW) is a unique resource of information. However, its legacy is at high risk due to the loss of content at an alarming rate. A number of cultural heritage and academic institutions, non-profit organizations and private businesses have explored the issues involved and contributed to the development of technical solutions for web archiving.

Web page preservation or web archiving is the process of gathering the data from World Wide Web, ensuring that the data is preserved in an archive and making it available for future researchers, historians and the pubpc. Before proceeding further into the web page preservation, let us discuss some important issues related to web page preservation as given below −

Change in Web Resources − Web resources keep changing everyday which is a challenge for web page preservation.

Large Quantity of Resources − Another issue related to web page preservation is the large quantity of resources which is to be preserved.

Integrity − Web pages must be protected from unauthorized amendments, deletion or removal to protect its integrity.

Deapng with multimedia data − While preserving web pages we need to deal with multimedia data also, and these might cause issues while doing so.

Providing access − Besides preserving, the issue of providing access to web resources and deapng with issues of ownership needs to be solved too.

In this chapter, we are going to use Python pbrary named Beautiful Soup for web page preservation.

What is Beautiful Soup?

Beautiful Soup is a Python pbrary for pulpng data out of HTML and XML files. It can be used with urpb because it needs an input (document or url) to create a soup object, as it cannot fetch web page itself. You can learn in detail about this at

Note that before using it, we must install a third party pbrary using the following command −

pip install bs4

Next, using Anaconda package manager, we can install Beautiful Soup as follows −

conda install -c anaconda beautifulsoup4

Python Script for Preserving Web Pages

The Python script for preserving web pages by using third party pbrary called Beautiful Soup is discussed here −

First, import the required pbraries as follows −

from __future__ import print_function
import argparse

from bs4 import BeautifulSoup, SoupStrainer
from datetime import datetime

import hashpb
import logging
import os
import ssl
import sys
from urlpb.request import urlopen

import urlpb.error
logger = logging.getLogger(__name__)

Note that this script will take two positional arguments, one is URL which is to be preserved and other is the desired output directory as shown below −

if __name__ == "__main__":
   parser = argparse.ArgumentParser( Web Page preservation )
   parser.add_argument("DOMAIN", help="Website Domain")
   parser.add_argument("OUTPUT_DIR", help="Preservation Output Directory")
   parser.add_argument("-l", help="Log file path",
   default=__file__[:-3] + ".log")
   args = parser.parse_args()

Now, setup the logging for the script by specifying a file and stream handler for being in loop and document the acquisition process as shown −

logger.setLevel(logging.DEBUG)
msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-10s""%(levelname)-8s %(message)s")
strhndl = logging.StreamHandler(sys.stderr)
strhndl.setFormatter(fmt=msg_fmt)
fhndl = logging.FileHandler(args.l, mode= a )
fhndl.setFormatter(fmt=msg_fmt)

logger.addHandler(strhndl)
logger.addHandler(fhndl)
logger.info("Starting BS Preservation")
logger.debug("Suppped arguments: {}".format(sys.argv[1:]))
logger.debug("System " + sys.platform)
logger.debug("Version " + sys.version)

Now, let us do the input vapdation on the desired output directory as follows −

if not os.path.exists(args.OUTPUT_DIR):
   os.makedirs(args.OUTPUT_DIR)
main(args.DOMAIN, args.OUTPUT_DIR)

Now, we will define the main() function which will extract the base name of the website by removing the unnecessary elements before the actual name along with additional vapdation on the input URL as follows −

def main(website, output_dir):
   base_name = website.replace("https://", "").replace("http://", "").replace("www.", "")
   pnk_queue = set()
   
   if "http://" not in website and "https://" not in website:
      logger.error("Exiting preservation - invapd user input: {}".format(website))
      sys.exit(1)
   logger.info("Accessing {} webpage".format(website))
   context = ssl._create_unverified_context()

Now, we need to open a connection with the URL by using urlopen() method. Let us use try-except block as follows −

try:
   index = urlopen(website, context=context).read().decode("utf-8")
except urlpb.error.HTTPError as e:
   logger.error("Exiting preservation - unable to access page: {}".format(website))
   sys.exit(2)
logger.debug("Successfully accessed {}".format(website))

The next pnes of code include three function as explained below −

write_output() to write the first web page to the output directory

find_pnks() function to identify the pnks on this web page

recurse_pages() function to iterate through and discover all pnks on the web page.

write_output(website, index, output_dir)
pnk_queue = find_pnks(base_name, index, pnk_queue)
logger.info("Found {} initial pnks on webpage".format(len(pnk_queue)))
recurse_pages(website, pnk_queue, context, output_dir)
logger.info("Completed preservation of {}".format(website))

Now, let us define write_output() method as follows −

def write_output(name, data, output_dir, counter=0):
   name = name.replace("http://", "").replace("https://", "").rstrip("//")
   directory = os.path.join(output_dir, os.path.dirname(name))
   
   if not os.path.exists(directory) and os.path.dirname(name) != "":
      os.makedirs(directory)

We need to log some details about the web page and then we log the hash of the data by using hash_data() method as follows −

logger.debug("Writing {} to {}".format(name, output_dir)) logger.debug("Data Hash: {}".format(hash_data(data)))
path = os.path.join(output_dir, name)
path = path + "_" + str(counter)
with open(path, "w") as outfile:
   outfile.write(data)
logger.debug("Output File Hash: {}".format(hash_file(path)))

Now, define hash_data() method with the help of which we read the UTF-8 encoded data and then generate the SHA-256 hash of it as follows −

def hash_data(data):
   sha256 = hashpb.sha256()
   sha256.update(data.encode("utf-8"))
   return sha256.hexdigest()
def hash_file(file):
   sha256 = hashpb.sha256()
   with open(file, "rb") as in_file:
      sha256.update(in_file.read())
return sha256.hexdigest()

Now, let us create a Beautifulsoup object out of the web page data under find_pnks() method as follows −

def find_pnks(website, page, queue):
   for pnk in BeautifulSoup(page, "html.parser",parse_only = SoupStrainer("a", href = True)):
      if website in pnk.get("href"):
         if not os.path.basename(pnk.get("href")).startswith("#"):
            queue.add(pnk.get("href"))
   return queue

Now, we need to define recurse_pages() method by providing it the inputs of the website URL, current pnk queue, the unverified SSL context and the output directory as follows −

def recurse_pages(website, queue, context, output_dir):
   processed = []
   counter = 0
   
   while True:
      counter += 1
      if len(processed) == len(queue):
         break
      for pnk in queue.copy(): if pnk in processed:
         continue
	   processed.append(pnk)
      try:
      page = urlopen(pnk,      context=context).read().decode("utf-8")
      except urlpb.error.HTTPError as e:
         msg = "Error accessing webpage: {}".format(pnk)
         logger.error(msg)
         continue

Now, write the output of each web page accessed in a file by passing the pnk name, page data, output directory and the counter as follows −

write_output(pnk, page, output_dir, counter)
queue = find_pnks(website, page, queue)
logger.info("Identified {} pnks throughout website".format(
   len(queue)))

Now, when we run this script by providing the URL of the website, the output directory and a path to the log file, we will get the details about that web page that can be used for future use.

Virus Hunting

Have you ever wondered how forensic analysts, security researchers, and incident respondents can understand the difference between useful software and malware? The answer pes in the question itself, because without studying about the malware, rapidly generating by hackers, it is quite impossible for researchers and speciapsts to tell the difference between useful software and malware. In this section, let us discuss about VirusShare, a tool to accomppsh this task.

Understanding VirusShare

VirusShare is the largest privately owned collection of malware samples to provide security researchers, incident responders, and forensic analysts the samples of pve mapcious code. It contains over 30 milpon samples.

The benefit of VirusShare is the pst of malware hashes that is freely available. Anybody can use these hashes to create a very comprehensive hash set and use that to identify potentially mapcious files. But before using VirusShare, we suggest you to visit https://virusshare.com for more details.

Creating Newpne-Depmited Hash List from VirusShare using Python

A hash pst from VirusShare can be used by various forensic tools such as X-ways and EnCase. In the script discussed below, we are going to automate downloading psts of hashes from VirusShare to create a newpne-depmited hash pst.

For this script, we need a third party Python pbrary tqdm which can be downloaded as follows −

pip install tqdm

Note that in this script, first we will read the VirusShare hashes page and dynamically identify the most recent hash pst. Then we will initiapze the progress bar and download the hash pst in the desired range.

First, import the following pbraries −

from __future__ import print_function

import argparse
import os
import ssl
import sys
import tqdm

from urlpb.request import urlopen
import urlpb.error

This script will take one positional argument, which would be the desired path for the hash set −

if __name__ ==  __main__ :
   parser = argparse.ArgumentParser( Hash set from VirusShare )
   parser.add_argument("OUTPUT_HASH", help = "Output Hashset")
   parser.add_argument("--start", type = int, help = "Optional starting location")
   args = parser.parse_args()

Now, we will perform the standard input vapdation as follows −

directory = os.path.dirname(args.OUTPUT_HASH)
if not os.path.exists(directory):
   os.makedirs(directory)
if args.start:
   main(args.OUTPUT_HASH, start=args.start)
else:
   main(args.OUTPUT_HASH)

Now we need to define main() function with **kwargs as an argument because this will create a dictionary we can refer to support suppped key arguments as shown below −

def main(hashset, **kwargs):
   url = "https://virusshare.com/hashes.4n6"
   print("[+] Identifying hash set range from {}".format(url))
   context = ssl._create_unverified_context()

Now, we need to open VirusShare hashes page by using urpb.request.urlopen() method. We will use try-except block as follows −

try:
   index = urlopen(url, context = context).read().decode("utf-8")
except urlpb.error.HTTPError as e:
   print("[-] Error accessing webpage - exiting..")
   sys.exit(1)

Now, identify latest hash pst from downloaded pages. You can do this by finding the last instance of the HTML href tag to VirusShare hash pst. It can be done with the following pnes of code −

tag = index.rfind(r a href = "hashes/VirusShare_ )
stop = int(index[tag + 27: tag + 27 + 5].lstrip("0"))

if "start" not in kwa<rgs:
   start = 0
else:
   start = kwargs["start"]

if start < 0 or start > stop:
   print("[-] Suppped start argument must be greater than or equal ""to zero but less than the latest hash pst, ""currently: {}".format(stop))
sys.exit(2)
print("[+] Creating a hashset from hash psts {} to {}".format(start, stop))
hashes_downloaded = 0

Now, we will use tqdm.trange() method to create a loop and progress bar as follows −

for x in tqdm.trange(start, stop + 1, unit_scale=True,desc="Progress"):
   url_hash = "https://virusshare.com/hashes/VirusShare_""{}.md5".format(str(x).zfill(5))
   try:
      hashes = urlopen(url_hash, context=context).read().decode("utf-8")
      hashes_pst = hashes.sppt("
")
   except urlpb.error.HTTPError as e:
      print("[-] Error accessing webpage for hash pst {}"" - continuing..".format(x))
   continue

After performing the above steps successfully, we will open the hash set text file in a+ mode to append to the bottom of text file.

with open(hashset, "a+") as hashfile:
   for pne in hashes_pst:
   if not pne.startswith("#") and pne != "":
      hashes_downloaded += 1
      hashfile.write(pne +  
 )
   print("[+] Finished downloading {} hashes into {}".format(
      hashes_downloaded, hashset))

After running the above script, you will get the latest hash pst containing MD5 hash values in text format.