Scanning PDF: what for?

Every one should now be aware that PDF files are as dangerous as any other files. Maybe even more because of the interactive features.

So, the question when scanning a PDF file is: what to look for?

Of course, you can try to detect malicious content. That will work most of the time, as long as this content is not concealed. For instance, one will be able to detect shellcode or virus, like it happens with most of the public exploits of the JBIG2 flaw (CVE-2009-0658).

However, PDF files can be suspicious just because they contain some unusual features. Knowing these features are there can help to make a decision...

Thus, we wrote a scanner for PDF files. In fact, we wrote 2 of them but in a single script (see scripts/scan/pdfscan.rb in origami):

  • -fast mode: this scan is based on regular expressions
  • -deep mode: much more slower, it is based on the real PDF semantic of objects, which assumes we need to interpret all of them in the file.

In both cases, the first step is to get a sanitized file, e.g. names of primitives are written properly, not mixing hexadecimal and usual ASCII (yes, one can write OpenAction or Open#61ction in the same way).

Scanning quickly

Let us have a look at the output of our scan:

~/malicious-pdf/src/scripts/scan$ ./pdfscan.rb ../exploits/adobe_geticon.pdf 
Reading file...
Fast scanning...
[File ID]
File: ../exploits/adobe_geticon.pdf
FileSize: 2366
[Structure]
Header: %PDF-1.2
Revisions: 1
Catalog: 1
object: 5
endobj: 5
stream: 1
endstream: 1
/ObjStm: 0
xref: 1
trailer: 1
startxref: 1
Root (current): 1 0 R
Size (current): 6
[Properties]
/Encrypt: 0
EmbeddedFile: 0
[Triggers]
/OpenAction: 0
/AA: 1
/Names: 0
[Actions]
/GoTo: 0
/GoToR: 0
/GoToE: 0
/Launch: 0
/Thread: 0
/URI: 0
/Sound: 0
/Movie: 0
/Hide: 0
/Named: 0
/SetOCGState: 0
/Rendition: 0
/Transition: 0
/Go-To-3D: 0
/JavaScript: 1
[FormActions]
/SubmitForm: 0
/ResetForm: 0
/ImportData: 0

First thing to notice, there are several sections in the analysis:

  • [File ID]: to remember what file is being analyzed
  • [Structure]: to have a quick view of the structure of the PDF (PDF version, objects, streams, object streams if present, ...). Each of these elements can give a clue to decide about the file status.
  • [Properties]: is the file encrypted (which can be used to hide some data, like a shellcode) or does it contains embedded files (which can be used as dropper by malwares).
  • [Triggers]: malicious contents is useless as long as it is not used, so we look for known ways to trigger events.
  • [Actions] and [FormActions]: even if an action on not mandatory to exploit a flaw, they are suspicious by nature as a PDF file should not contain dynamic actions most of the time.

Let us go back to our sample now. When a pattern is searched, if the returned result is surprising, it is highlighted (in red). It simply means there is something to look in the PDF. You can then run the walker (our nice GTK GUI designed to explore the content of a PDF).

Another way is to run a deep scan, which is much more slower. This deep scan fixes some mistakes made by the fast scan.

Scanning deeply

The deep scan is based on our parser. It means we go deep into the objects. For instance, streams are uncompressed (including object streams) and then research are made on the result of the parsing. It the parser fails, it means the PDF was not correct. As a (sad) consequence, it is really slow, especially for files with many objects.

Let's have a look.

Fast scan Deep scan
./pdfscan.rb   /tmp/0416.pdf  
Fast scanning...
[File ID]
File: /tmp/0416.pdf
FileSize: 989281
[Structure]
Header: %PDF-1.3
Revisions: 2
Catalog: 1
object: 191
endobj: 190
stream: 102
endstream: 102
/ObjStm: 0
xref: 2
trailer: 2
startxref: 2
Root (current):
Size (current): 183
[Properties]
/Encrypt: 0
EmbeddedFile: 0
[Triggers]
/OpenAction: 1
/AA: 0
/Names: 0
[Actions]
/GoTo: 1
/GoToR: 0
/GoToE: 0
/Launch: 0
/Thread: 0
/URI: 0
/Sound: 0
/Movie: 0
/Hide: 0
/Named: 0
/SetOCGState: 0
/Rendition: 0
/Transition: 0
/Go-To-3D: 0
/JavaScript: 0
[FormActions]
/SubmitForm: 0
/ResetForm: 0
/ImportData: 0
./pdfscan.rb  -t deep /tmp/0416.pdf  
Deep scanning...
[File ID]
File: /tmp/0416.pdf
FileSize: 989281
[Structure]
Header: %PDF-1.3
Revisions: 2
Catalog: 1
Indirect objects: 191
Total objects: 1589
Streams: 102
Object streams: 0
Root (current):
Size (current): 183
[Properties]
Encrypted: false
EmbeddedFile: 0
[Triggers]
/OpenAction: 1
/AA: 0
/Names: 0
[Actions]
/GoTo: 1
/GoToR: 0
/GoToE: 0
/Launch: 0
/Thread: 0
/URI: 0
/Sound: 0
/Movie: 0
/Hide: 0
/Named: 0
/SetOCGState: 0
/Rendition: 0
/Transition: 0
/Go-To-3D: 0
/JavaScript: 0
[FormActions]
/SubmitForm: 0
/ResetForm: 0
/ImportData: 0

Information displayed in the section Structure are not the same:

  • The closing tags (endobj and endstreamare not counted because if they are missing, the parser will complain.
  • Tags obj ... endobj are used for indirect objects, but you can have embedded objects inside other objects. Hence, the fast scan is unable to spot them, but the deep scan displays the total number of objects.
  • The Root object is empty: it is because the PDF we use is linearized and includes several updates. We need to add this to our parser to have a proper support.

Regarding the results now, no more warning are raised about the indirect objects. The difference in the fast scan is due to a mistake in the regular expression used to find the closing endobj. However, since the parser used during the deep scan does not complains, it means nothing weird is hidden there.

Scanning encrypted files

We have already discussed it is possible to encrypt a PDF. However, encryption of a PDF file deals mostly with content of streams and strings, not with PDF primitives. Hence, even if it is encrypted, we can still spot suspicious actions most of the time (yes, there is a way - at least - to hide them too, but we'll talk about that another day ;)

Fast scan Deep scan
./pdfscan.rb   ../crypto/encrypted_calc.pdf
Fast scanning...
[File ID]
File: ../crypto/encrypted_calc.pdf
FileSize: 2568
[Structure]
Header: %PDF-1.5
Revisions: 1
Catalog: 1
object: 6
endobj: 6
stream: 1
endstream: 1
/ObjStm: 0
xref: 1
trailer: 1
startxref: 1
Root (current): 1 0 R
Size (current): 7
[Properties]
/Encrypt: 1
EmbeddedFile: 0
[Triggers]
/OpenAction: 1
/AA: 0
/Names: 0
[Actions]
/GoTo: 0
/GoToR: 0
/GoToE: 0
/Launch: 3
/Thread: 0
/URI: 0
/Sound: 0
/Movie: 0
/Hide: 0
/Named: 0
/SetOCGState: 0
/Rendition: 0
/Transition: 0
/Go-To-3D: 0
/JavaScript: 0
[FormActions]
/SubmitForm: 0
/ResetForm: 0
/ImportData: 0
./pdfscan.rb  -t deep ../crypto/encrypted_calc.pdf
Deep scanning...
[File ID]
File: ../crypto/encrypted_calc.pdf
FileSize: 2568
[Structure]
Header: %PDF-1.5
Revisions: 1
Catalog: 1
Indirect objects: 6
Total objects: 85
Streams: 1
Object streams: 0
Root (current): 1 0 R
Size (current): 7
[Properties]
Encrypted: true
EmbeddedFile: 0
[Triggers]
/OpenAction: 1
/AA: 0
/Names: 0
[Actions]
/GoTo: 0
/GoToR: 0
/GoToE: 0
/Launch: 1
/Thread: 0
/URI: 0
/Sound: 0
/Movie: 0
/Hide: 0
/Named: 0
/SetOCGState: 0
/Rendition: 0
/Transition: 0
/Go-To-3D: 0
/JavaScript: 0
[FormActions]
/SubmitForm: 0
/ResetForm: 0
/ImportData: 0

This PDF is supposed to launch the calculator when it is opened. Both scanners see the /Launch command. However, if you look in the PDF, you do not know what command is going to be launched:

        /OpenAction <<
/F <<
/Mac (~\y^G<~D^G\n=2?Xp~[9Woc~BR~YA~_~R^^^?~@3^K)
/Unix (^TO`^]^R~E~Yfw\(0~WP!\)~G9~L)
/DOS (q5B]{^H\n\n~~G^X^SX^V| Q~\&E{i^P@{ Y~X^@^^^S)
>>
/S /Launch
>>

Last point: the fast scan tells us there are 3 Launch whereas the deep one tells there is only 1. Who's right?

The deep scan of course! The deep scan is based on the objects. If it has detected a single /Launch action, it means there is only one. So, the 2 others detected by the fast scan should come from somewhere else. And as a matter of fact, the string "/Launch" is contained in the text written on the page. So, the fast scan can be overreacting sometimes, but at least, it will not get false negative.

Last words

Scanning a file for something suspicious can be long and tricky. Every scan is a compromise between speed and precision. We propose each way to perform a scan. What is sure is that our fast scanner makes mistake whereas the deep one is as efficient as the parser we have written for the PDF language ... but is damn slow especially on big files!

Last but not least: remember, this script is not supposed to decide for you whether a PDF is dangerous, but it is simply here to help you to decide.