When you need to quickly analyze a lot of data, there is one critical step to perform: Triage. In forensic investigations, this step is critical because it allows investigators to quickly identify, prioritize, and isolate the most relevant or high value evidence from large volumes of data, ensuring that limited time and resources are focused on artifacts most likely to reveal key facts about an incident. Sometimes, a quick script will be enough to speed up this task.
Today, I’m working on a case where I have a directory containing +20.000 mixed files. Amongst them, a lot of ZIP archives (mainly Office documents), containing also lot of files. The idea is to scan all those files (including the ZIP archives) for some keywords. I wrote a quick Python script that will scan all files against the embedded YARA rule and, if a match is found, copy the original file into a destination directory.
Here is the script:
# # Quick Python triage script # Copy files matching a YARA rule to another directory # import yara import os import shutil import zipfile import io # YARA rule yara_rule = """ rule case_xxxxxx_search_1 { strings: $s1 = "string1" nocase wide ascii $s2 = "string2" nocase wide ascii $s3 = "string3" nocase wide ascii $s4 = "string4" nocase wide ascii $s5 = "string5" nocase wide ascii condition: any of ($s*) } """ source_dir = "Triage" dest_dir = "MatchedFiles" os.makedirs(dest_dir, exist_ok=True) rules = yara.compile(source=yara_rule) def is_zip_file(filepath): """ Check ZIP archive magic bytes. """ try: with open(filepath, "rb") as f: sig = f.read(4) return sig in (b"PK\x03\x04", b"PK\x05\x06", b"PK\x07\x08") except Exception: return False def safe_extract_path(member_name): """ Returns a safe relative path inside the destination folder (Prevent .. in paths). """ return os.path.normpath(member_name).replace("..", "_") def scan_file(filepath, file_bytes=None, inside_zip=False, zip_name=None, member_name=None): """ Scan a file with YARA. """ try: if file_bytes is not None: matches = rules.match(data=file_bytes) else: matches = rules.match(filepath) if matches: if inside_zip: print("[MATCH] {member_name} (inside {zip_name})") rel_path = os.path.relpath(zip_name, source_dir) filepath = os.path.join(source_dir, rel_path) dest_path = os.path.join(dest_dir, rel_path) else: print("[MATCH] {filepath}") rel_path = os.path.relpath(filepath, source_dir) dest_path = os.path.join(dest_dir, rel_path) # Save a copy os.makedirs(os.path.dirname(dest_path), exist_ok=True) shutil.copy2(filepath, dest_path) except Exception as e: print(e) pass # Main for root, dirs, files in os.walk(source_dir): for name in files: filepath = os.path.join(root, name) if is_zip_file(filepath): try: with zipfile.ZipFile(filepath, 'r') as z: for member in z.namelist(): if member.endswith("/"): # Skip directories continue try: file_data = z.read(member) scan_file(member, file_bytes=file_data, inside_zip=True, zip_name=filepath, member_name=member) except Exception: pass except zipfile.BadZipFile: pass else: scan_file(filepath)
Now, you can enjoy some coffee while the script does the job:
[MATCH] docProps/app.xml (inside Triage\xxxxxxx.xlsx) [MATCH] xl/sharedStrings.xml (inside Triage\xxxxx.xlsx) [MATCH] xl/sharedStrings.xml (inside Triage\xxxxxxxxxxxxxxxxxxxx.xlsx) [MATCH] ppt/slides/slide3.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx) [MATCH] ppt/slides/slide12.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx) [MATCH] ppt/slides/slide14.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx) [MATCH] ppt/slides/slide15.xml (inside Triage\xxxxxxxxxxxxxxxxxxxxxx.pptx) [MATCH] xl/sharedStrings.xml (inside Triage\xxxxxxxx.xlsx) [MATCH] Triage\xxxxxxxxxxxxxxxxxxxxxxx.pdf [MATCH] Triage\xxxxxxxxxxxxxxxxxxx.xls [MATCH] xl/sharedStrings.xml (inside Triage\xxxxxxxxxxxxxxxx.xlsx) [MATCH] Triage\xxxxxxxxxxxxxxxxxxxxxxxxxx.xls
You can see that, with a few lines of Python, you can speedup the triage phase in your investigations. Note that the script is written to handle my current files set and is not ready for broader use (lile to handle password-protected archives or other types of archives)
Xavier Mertens (@xme)
Xameco
Senior ISC Handler – Freelance Cyber Security Consultant
PGP Key