Streams and filters in PDF with origami

Fri 19 June 2009 by fred

As we explained in the previous article, streams are a really important kind of object in PDF. Any data is represented as a stream. However, keeping raw data in a file can be inefficient (think about encoding or size issues for instance). So, this article shows how to create / manipulate streams and filters with origami. In the end, you should understand why it is easy to conceal JavaScript in a PDF file.

What are filters?

As we previously explained, a stream is a dictionary and external data:

42 0 obj <<
    ....
>>
stream
  data is here ...
endstream
endobj

A stream can be used to embed another file, a video, an image, and so on. All these external data do not need the same work in order to be properly. For instance, embedding a JPEG image will need a JPEG decompression algorithm so that the image will be nicely displayed.

So streams support several kinds of actions which aim at "transforming" the data in a way they can be used: ASCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode, CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode.

Raw data in streams

Let us create a first PDF file containing raw data:

pdf = PDF.new
contents = ContentStream.new
contents.write "Hello world",
  :x => 350, :y => 750, :rendering => PS::Text::Rendering::STROKE, :size => 30
page = Page.new
page.Contents = contents
pdf.append_page(page)
pdf.saveas("out.pdf")

We just have created a ContentStream designed to write text on a page. If we look in the resulting file, we can read clear text:

4 0 obj
<<
        /Length 55
>>stream
BT
/F1 30 Tf 350 750 Td 20 TL
1 Tr (Hello world) Tj
ET
endstream
endobj

Setting a filter on streams

Now, let us change a single line:

-contents = ContentStream.new
+contents = ContentStream.new( "", :Filter => :ASCIIHexDecode )

When calling the constructor for the stream, we set an empty string (text will be added later), and select an ASCIIHexDecode filter. The PDF file now contains:

4 0 obj
<<
        /Length 111
        /Filter /ASCIIHexDecode
>>stream
42540A2F4631203330205466203335302037353020546420323020544C0A31205472202848656C6C6F20776F726C642920546A200A4554>
endstream
endobj

For the programmers, you can also call the method setFilter to get the same result:

contents = ContentStream.new.setFilter( :ASCIIHexDecode )

Chaining filters

Simple: instead of provinding a single filer, set several of them at once with an array:

contents = ContentStream.new.setFilter( [ :ASCIIHexDecode, :LZWDecode, :ASCII85Decode ] )

The resulting stream is double-encoded and compressed:

4 0 obj
<<
        /Filter [ /ASCIIHexDecode /LZWDecode /ASCII85Decode ]
        /Length 163
>>stream
800D878221D1186E502888C8230241847C2C158F8B26C178AC7E29178CC5642191ACDE311609CD04D1C918784B398F05873150C0703731954A4442E9A86E4222988DE381C46C66283A8907452371F87D01>
endstream
endobj

Important point: the filters are applied starting from the end back to the beginning

Encoding JavaScript

Ok, but concealing text is not that interesting. True! But I wrote at the beginning of this article that many things are represented as streams ... including JavaScript.

pdf = PDF.read( ARGV[0] )
jscript = "app.alert('hello world!');"
jsaction = Action::JavaScript.new(
                   Stream.new(jscript,
                              :Filter =>[:ASCIIHexDecode, :FlateDecode]))
pdf.onDocumentOpen(jsaction)
pdf.saveas("out.pdf")

When the generated PDF file is opened, a pop-up windows says "hello world" as expected. But when you look in the PDF file:

6 0 obj
<<
        /Length 69
        /Filter [ /ASCIIHexDecode /FlateDecode ]
>>stream
78DA4B2C28D04BCC492D2AD150CF48CDC9C95728CF2FCA495154D7B406007F4808DF>
endstream
endobj

Even if it does not look like a JavaScript, it is!

The final countdown: ciphering

Last but not least, you can cipher a PDF file. I will not go into the details now as ciphering is not a filter, but it just makes things a bit more painful to read. First, you must remember that only strings and content of streams are encrypted. Second, up to encryption mode 4 (out of 5), one can set empty password on a file. So, based on these 2 facts, we slightly change the previous origami script:

pdf.onDocumentOpen(jsaction)
+pdf.encrypt("", "", :Algorithm => :AES) # Empty User and Owner passwords
pdf.saveas("enc.pdf")

We add a single line for the following result:

6 0 obj
<<
        /Filter [ /ASCIIHexDecode /FlateDecode ]
        /Length 96
>>stream
~G0=ÆÂ~A.ýo^KÎM©#~UF$^@~Y~_+cúµR~NpîžË^`ï°(œÇõü>T¶WqZ""Ê`Mgv^CÏpü^R6^T~O^D^PÒ[@©Œxa^^TÜÉ~JŒ^R­ÈöÆZ*ÍåEe~^R]cèþJê2X
endstream
endobj

And the window still pops up saying "hello world".

Help needed

Currently, origami does not support all the filters:

  • Supported: SCIIHexDecode, ASCII85Decode, LZWDecode, FlateDecode, RunLengthDecode
  • TODO: CCITTFaxDecode, JBIG2Decode, DCTDecode, JPXDecode

Feel free to send us your patch ;)