How to scan a PDF for malware?

  • Can anyone suggest an automated tool to scan a PDF file to determine whether it might contain malware or other "bad stuff"? Or, alternatively, assigns a risk level to the PDF?

    I would prefer a free tool. It must be suitable for programmatic use, e.g., from the Unix command line, so that it is possible to scan PDFs automatically and take action based upon that. A web-based solution might also be OK if it is scriptable.

  • Very easy.

    Didier Stevens has provided two open-source, Python-based scripts to perform PDF malware analysis. There are a few others that I will also highlight.

    The primary ones you want to run first are PDFiD (available another with Didier's other PDF Tools) and Pyew.

    Here is an article on how to run and see the expected results; Here is another for pyew.

    Finally, after identifying possible JS, Javascript, AA, OpenAction, and AcroForms -- you will want to dump those objects, filter the Javascript, and produce a raw output. This is possible with

    Additionally, Brandon Dixon maintains some extremely elite blog posts on his research with PDF malware, including a post about scoring PDFs based on malicious filters just like you describe.

    I, personally, run all of these tools!

    +1 good reply, Didier Stevens is the main focus when looking at PDF based malware. He also has tools such as '' which enable you to create proof of concept malicious PDFs to understand their structure more fully and why they are an issue.

    thanks for all the resources! I don't mean to sound ungrateful, but: to be honest, what I was ideally hoping for was a single script that I could run that outputs a binary answer ("this looks like malware" or "this looks safe"). In contrast, the resources you mention look like they're intended for for a professional analyst who wants to poke through the contents of a PDF file and form an opinion based upon their human judgement. Do you know of anyone who has packaged these up into a single script that produces a yes/no answer (or a numeric risk score)?

    @ D.W. Yeah that last URL I mentioned does what you want, but it does happen to be a Python script. It's really easy to install Python -- don't let it scare you!

    Here is a newer tool with newer techniques from early 2016 --

    @atdre Using ``, how to find out if `ObjStm`s are malicious?

    can pdftk be used to dissasemble/reassemble it discarding any bad embedded stuff (at lest 99%)

    maybe, but pdftk could be vulnerable, or linked library or any other system-wide security failure, to a memory corruption attack kicked off by said-same embedded stuff. an easy example of this can be found in CVE-2019-7113 via a specially-crafted pdf. unless you are debugging your own pdf file-loading code while at the same time reversing any-given pdf, you may never know depending on the evasion techniques employed during and after code execution. A few things might pick up on this activity, such as osquery (or any EDR) security/incident packs/configs; others defend against, e.g., GRSecurity.

  • Just came by this very recent blog post by Lenny Zeltser which is pretty much right on the money

    6 Free Tools for analyzing Malicious PDF Files

    The tools he mentions are:

    There are details about each one and links to other pdf analysis documents at the blog post.

  • For the past few months I have been doing research on PDF analysis and how it could be better improved. While doing the research I found myself writing tools and scripts to help me get the job done and decided it was time to put something more useful together. PDF X-RAY is a static analysis tool that allows you to analyze PDF files through a web interface or API. The tool uses multiple open source tools and custom code to take a PDF and turn it into a sharable format. The goal with this tool is to centralize PDF analysis and begin sharing comments on files that are seen.

    PDF X-RAY differs from all other tools because it doesn't focus on the single file. Instead it compares the file you upload against thousands of malicious PDF files in our repository. These checks look for similar data structures within the PDF you upload and ones that have been reviewed by analysts. Using this feature we can begin to see shared coded samples among malicious files or trends due to malicious author coding styles. The tool is still in beta, but I wanted to release it to the public to see what users thought. In my opinion the API is the most useful as you can begin to integrate rich PDF analysis into other tools and services with little or no cost.

    Current features include:

    • Summary report
    • Interactive report (includes all the information I have)
    • Related through characteristics
    • Account access and features
    • Full API (submit, report, full object, etc.)
    • Searching (not all implemented, but all hashing aspects work)
    • Sandbox dump of JS code
    • Flagging of streams (malicious or not malicious) for logged in users (anonymous users can see how many people marked something as malicious)
    • Reports (last 50 ran among others (some not yet released) )
    • Social network hooks (causes some slowness, so I may replace this)
    • Basic help documentation
    • Image preview generation

    Sample Report from

    Looks like this domain is down.

  • I am actually in the process of moving my tool (see Scoring PDFs Based on Malicious Filter - 9b+) to a hosted environment where you can upload samples through an API or web portal. In its current state it will scan the PDF, pull in as much data as possible and compare it against hundreds of other malicious files. Please send me an email and I will be sure to let you know personally when it is available for use.

    In the mean time, you can use the filter I have created which detects a little more than 50% of my malware now. It needs to be tweaked a little, but I would be willing to take a look at your samples personally and give you my best guess. Like I said, email me and we can exchange information.

    It should be pointed out that what you are asking for is very rare and a problem that not many people have solved. The problem is hard to approach given the flexibility in the PDF specification. The reality is that there is no scanner that I currently know of including mine that will dive recursively into nested objects where malicious content could live. With that said, your best bet is to get up to speed just a little bit on what to look for in a PDF. Didier has good resources for that as well good posts. Read my blog and you will gain an idea of how I am approaching this problem.

  • Have you tried just utilising VirusTotal as an indicator of potential malicious content? I know this is my first stop for most file verifications. You could script up a curl request to their MD5 search engine perhaps?

    I thought VirtusTotal used SHA256?

    I believe it does both? You should probably use SHA256 for future-proofing. Cheers Dre!

License under CC-BY-SA with attribution

Content dated before 7/24/2021 11:53 AM