New technique detects tampering or forgery of a PDF document

Researchers from the University of Pretoria presented a new technique for detecting tampering in PDF documents by analyzing the file’s page objects. The technique employs a prototype that can detect changes to a PDF document, such as changes made to the text, images, or metadata.

Prototype flow

With the PDF format being used as a formal means of communication in multiple industries, it has become a good target for criminals who wish to affect contracts or aid in misinformation.

With technology improving and easy access to tools like Adobe Acrobat and many free online editors, it’s now simple to change PDF files even without much knowledge of the format. This makes it important to detect and analyze any changes made to PDF documents.

The challenge with PDFs

Most current techniques for detecting changes in PDFs rely on watermarking and hashing. Watermarking involves embedding hidden marks in the document, while hashing generates unique codes based on the file’s content.

Although these methods can detect changes to visible parts of a PDF, such as text and images, they generally do not analyze hidden elements like metadata or background data.

Because these techniques focus on visible content, they cannot detect alterations that embed malware via PDF scripting features. Similarly, changes to PDF digital signatures often go unnoticed, which poses serious risks.

While watermarking and hashing can indicate whether a document has been altered, they typically cannot identify exactly where or what was changed. This is because even a small edit creates a different hash, making it difficult to pinpoint the specific modification.

How the new prototype works

The new prototype is specifically designed to detect tampering or forgery in PDF documents by utilizing their file page objects. Developed with Python, the prototype leverages the hashlib, Merkly, and PDFRW libraries for generating hashes and accessing intricate PDF structures.

The PDFRW library was deliberately chosen because it provides lower-level access to PDF structures, which is beneficial for custom manipulation tasks and can offer speed advantages for handling large or complex PDFs.

The prototype performs two primary functions: protecting a PDF and assessing a PDF for forgery.

Protecting the PDF document

Firstly, to enable future detection of changes, a PDF document must first be “protected” by running it through the prototype. This initial step involves the prototype reading the PDF document, which it does using the PDFRW library to convert it into a dictionary-like object. Once read, the prototype isolates the file page objects for each page within the document.

For each page, the system then proceeds to calculate unique digital fingerprints, known as hashes, from various elements. The content stream of each file page object, which describes how the page’s text, images, and graphics are displayed, is crucial here.

This content stream is systematically divided into small 256-byte pieces. From these pieces, a Merkle tree is constructed, yielding individual “leaf” hashes for each small section and a single “root” hash for the entire page’s content.

Additionally, hashes are calculated for the file page object itself and the document’s overall metadata. During this process, some sub-objects of the file page object are excluded from hashing to ensure consistent results, as different PDF editors may not create and update PDF documents uniformly.

Once these hash values are computed, they are secretly embedded as new, hidden keys directly into the relevant file page object (using keys like ‘hashobject’, ‘hashroot’, and ‘hashleafs’) and the PDF’s main “root” object (using keys like ‘hashroot’ and ‘hashinfo’). This embedding effectively creates a hidden, unalterable record of the document’s original state.

Finally, the prototype uses the PDFRW library to save a new PDF document that includes these hidden security marks, which then becomes the “protected” original for future checks. It is important to note that this “protection” process itself creates a new file, which is a copy of the original with these security features added.

Checking for forgery

Secondly, to check a protected PDF document for any alterations, the same prototype is used. The system begins by reading the PDF and extracting the previously stored hidden hash values from both the root object and all the file page objects.

After extraction, these stored hash values are temporarily removed from their respective objects before the prototype generates a new set of hashes from the PDF’s current content.

The core of the detection lies in the comparison: these newly calculated hashes are then compared to the original stored hashes. If any discrepancy is found between the two sets of hashes, the prototype immediately signals that changes have been detected. A significant strength of this method is its ability to precisely locate the changes.

The system can inform you not only which page was altered, but also the exact 256-byte section within that page’s content where the change occurred. It can also specifically indicate if the document’s main metadata has been changed.

PDF tampering prototype works well with Adobe Acrobat

The prototype was primarily tested and confirmed effective when changes were made using Adobe Acrobat. While the prototype should theoretically detect changes regardless of the editor because protected PDFs are produced uniformly by the PDFRW library, this specific testing context is important to consider.

Furthermore, it’s worth noting that the current prototype does not yet detect all possible PDF changes, such as altering a document’s font without changing the actual content or adding JavaScript code. The system requires the PDF to be “protected” beforehand; it cannot assess an unprotected PDF as it needs the embedded hash values for comparison.

Source link