Chapter 2. Overview

Portable Document Format (PDF) is a file format created by Adobe Systems for document exchange. Each Adobe PDF file encapsulates a complete description of a fixed-layout 2D document that includes the text, fonts, images, and 2D vector graphics which compose the documents. PDF is an ISO standard, published under title Document management—Portable document format—Part 1: PDF 1.7.

PDFLeo allows users to post-process PDF files generated by other programs, such as PDF printers, word processors (such as Word 2007 and OpenOffice Writer), and other programs such as FOP and iText. Many of these programs lack some features. For example, Some does not provide encryption. Some produce files with quite large sizes. And some leave metadata fields empty. Most of them do not produce web optimized files. Those shortcomings can be fixed with pdfleo.

2.1. Product Features

PDFLeo supports the following kinds of PDF processing:

  • Encryption. Encrypt a PDF document with either password security or public key security. Remove PDF encryption if you are able to open the file. Retain PDF encryption and permission settings in the new PDF with other part of document modified.

  • Linearization. Optimize PDF documents to be viewed over slow connection from a capable web server.

  • Size Optimization. Reduce the file size by removing redundant contents, compressing streams and moving objects to streams.

  • Query document information, such as meta data, security, document permission and font information.

  • Insert and modify predefined or custom document information entries.

  • Insert, view and modify XMP metadata.

2.2. Technical View

A PDF file consists primarily of objects, of which there are eight types:

  • Boolean values, representing true or false

  • Numbers

  • Strings

  • Names

  • Arrays, ordered collections of objects

  • Dictionaries, collections of objects indexed by Names

  • Streams, usually containing large amounts of data

  • The null object

Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update).

Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects.

2.2.1. Layouts

There are two layouts to the PDF files: normal and linearized. Non-linearized PDF files consume less disk space than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document are scattered throughout the PDF file. Linear PDF files (also called "optimized" or "web optimized" PDF files) are constructed in a manner that enables them to be read in a Web browser plugin without waiting for the entire file to download, since they are written to disk in a linear (as in page order) fashion.

2.2.2. Metadata

Metadata is specified in two ways: the "native" info dictionary, and the metadata stream. The info dictionary method suffers several shortcomings: Unless the indexing software understands the format, it is not possible to locate the metadata. Adobe only published a few metadata entries, and there is no standard method on the names and contents. Furthermore, only a string per entry is supported in the info dictionary. If the file is encrypted, the index software requires the key to access the entries.

PDF metadata stream stores metadata in a standard format called XMP. XMP is serialized and stored using W3C RDP format (which is based on XML), and can be placed into a variety of multimedia file such JPEG. When stored as plain text in the file, XMP contains a special header indicating that it is XMP as well as the encoding format. Therefore, an indexing software can search the file for this special mark, and loads the content without understanding the PDF file format.

PDFLeo allows user to insert XMP metadata into existing PDF files, as well as modifying existing info dictionary entries. When info dictionary is modified, the change is synced back to the metada.

2.2.3. Encryption

PDF allows all string and stream objects (except metadata stream) to be encrypted to prevent unauthorized people to access the content. The encryption method employed can be either RC4 (with key length from 40 bits to 128 bits) and AES (128 bits). Before decrypting the content, a master key must be obtained. Security handler is to calculate the master key. PDF standard defines two kinds of security handlers, and pdfleo supports both.

  • Password security handler. Called standard handler in PDF specification, this security handler allows user to specify two passwords: the user password, and the owner password. Through some data stored in the PDF file, the user password can be calculated once the owner password is known. The user password is used to decrypt the master key.

  • Public key security handler. This security handler uses X509 public certificate to encrypt the master key. Multiple receipts can be specified in this manner. The user supplies his private key when he views the file. In this manner no password exchange is required. The creator only needs the public certificate of the recipient.

It is possible that a PDF uses a non-documented security handler, or uses a live server for authentication. pdfleo can't decrypt these files, as most other PDF programs.

2.3. Supplying Credentials

If the source document is encrypted, credentials are required to gain access to the document. PDF standard defines two types of securities - password and public key. There are other types of possible securities, but they are not standard therefore can not be read by pdfleo.

2.3.1. Specifying Passwords

Password security allows two passwords - the owner password and the user password. Users with owner passwords have unrestricted privileges to the document, while those with user passwords have limited permissions defined by the author. It should be noted that the permission is enforced by the application. Users with user password effectively have unrestricted access to the document with the help of the application. For example, a PDF document with empty user password can have the encryption settings removed with pdfleo.

Passwords are specified through --password switch. If password is empty, there is no need to specify as pdfleo will try using empty password if other attempts failed.

Multiple passwords, up to 10 can be specified at the command line. pdfleo will try them in the order specified. pdfleo does not honor the permission flag, specifying either password will work in most cases.

C:\>pdfleo --password=oooo134 --password=sjusl \
    source.pdf target.pdf
    

PDFLeo will try passwords in the order that they are specified. Some action, such as copying owner password to the new file, requires the file opened by an owner password. Therefore, if both passwords are known, owner password should be place before user password.

2.3.2. Specifying digital ID file

Public key encrypted PDF files require private key file and the password to access the private key file. A colon (:) separates the two parts. The private key file must be in standard pkcs#12 format and usually has extension of pvk or p12. If password is required to access the private key, it must be specified. For example, the following parameter identify private key file c:\company.pvk and the password is password.

C:\>pdfleo --digital-id=c:\company.pvk:mypass source.pdf target.pdf
  

Multiple digital ID switches, up to 10 can exist at the command line. PDFLeo will try them in the order that they are specified.

2.4. Encryption, Decryption and Permissions

The detailed information on how to encrypt, decrypt or change permissions is covered in Chapter 3, PDF Security. This section gives a quick summary and some examples.

2.4.1. Querying Security Settings

Security settings are displayed using --info switch.

C:\>pdfleo -i test2.pdf 
...
=============== Document Security ==============================
Security Method: Password Security 1
  Authorized by: User Password     2
          Print: None              3
  Accessibility: No
         Modify: Assembly
Encryption Level: AES (128-bit)    4
... 

1

Document is encrypted with passwords

2

Document is opened using user password (which is empty)

3

Print permission: no print allowed

4

Encryption method is 128-bit AES.

For public key encrypted documents, the program prints list of recipients:

C:\>pdfleo --digital-id=tester.pfx:1234demo pubkey.pdf
...
================ Document Security ==============================
Security Method: Certificate Security
          Print: High Resolution
  Accessibility: No
         Modify: Assembly
 Clear Metadata: No
Encryption Level: AES (128-bit)

List of Recipients:
C=US, O=Equifax, OU=Equifax Secure Certificate Authority 1
CN=Test Group, O=Morovia, OU=Morovia/emailAddress=tester@morovia.com, C=CA 2
...

1

Recipient #1: Equifx Certifcate Authority

2

Recipient #2: Test Group (tester@morovia.com)

2.4.2. Encrypting Document

PDF documents can be encrypted with password security or public key security. A number of parameters can be specified such as encryption level, bit length and so on.

C:\>pdfleo --encrypt=new;AES-128;;yes \ 1
   --password-security=demo;;print=high;modify=none  2

1

Document is encrypted with 128-bit AES. Metadata is not encrypted.

2

Document is secured with passwords - owner password is demo. User password is empty. Permission: high resolution and now modification.

2.4.3. Removing Encryption

Encryption is removed by specifying --encrypt=discard without additional parameters.

C:\>pdfleo --password=Monster --encrypt=discard \
     source.pdf target.pdf
     

2.4.4. Preserving Encryption

By default, the encryption is preserved in the output PDF. All security setting remain intact. For example, the below command creates a linearized PDF file with encryption copied from the source PDF:

C:\>pdfleo --password=Monster \
    --linearize source.pdf target.pdf
    

The preserve mode can be explicitly specified though --encrypt=preserve parameter. The command below achieves the same results:

C:\>pdfleo --password=Monster --encrypt=preserve\
    --linearize source.pdf target.pdf
    

2.5. Web-Optimized (Linearized) PDF

A non-linearized PDF file must be fully downloaded to the client computer before a viewer can display the pages. Linearization ( called Fast Web View in Acrobat and Adobe Reader) transforms PDF into such a format that a capable viewer can find out which byte range to ask for when display the page requested with only a few K bytes downloaded. It then asks web server for bytes within that rage. The capability of byte-serving is required by HTTP 1.1 protocol and is supported by most web servers.

In order to take advantage this feature, you need:

  • A web server that supports byteserving. As the protocol is part of http 1.1, most current web servers already are capable.

  • The viewer must support this feature. In Acrobat or Adobe Viewer, make sure that the option Allow fast web view is checked. This option is enabled by default.

  • The PDF file is linearized, which can be achieved by pdfleo.

Linearization provides better user experience when serving large PDF files (measured in number of pages or file size in MB) over web or other slow connections. Linearization will generally increase the file size. In some cases it increases the file size significantly if you have a large PDF file and many objects are compressed into an object stream.

Linearization feature is applied through -linearize switch. It can be used in conjunction with other transforms, such as encryption and compression.

C:\>pdfleo --linearize source.pdf target.pdf
  

You can verify the result using pdfleo --info. As illustrated below:

C:\>pdfleo --info target.pdf
  ...
Number of Pages: 30
     Tagged PDF: No
     Linearized: Yes 1
      Page Size: 8.50x11.00 in
  ...   

1

the document is linearized

You can also verify it through Adobe Reader or Acrobat. Open the document and select FileProperties... (Ctrl+D). The Fast Web View entry will show Yes, which indicates that the document is linearized.

Note

Linearization is not preserved during transformation, unless --linearize option is specified. If the source PDF is linearized but no --linearize is specified, the resulted PDF will not be linearized.

2.6. Optimization (Size Reduction)

Due to the method that pdfleo utilizes to read the source document, some optimization techniques are always applied. Additional steps can be taken to further reduce file size, such as compressing stream objects and placing objects into streams. The techniques involved include:

  • Removing unused objects. Unused objects will be discarded. If a PDF is produced through incremental update, many objects are not needed. Incremental update is a feature to allow a processing application to append changes at the file end without removing prior object definitions. This technique reduces the memory usage at the cost of bigger file size.

  • Writing objects in a compact syntax. PDFLeo writes output using compact syntax Extra white spaces are removed. Hexadecimal strings are written with more compact binary representations.

  • Compressed streams. When specified, pdfleo compresses all streams except those who must be kept intact.

  • Object streams. Non stream objects can be placed into a special object stream and compressed.

Optimization is controlled through two switches: --stream-data and --object-stream. The --stream-data option controls how individual stream is compressed. --object-stream controls whether or not object streams should be generated.

PDFLeo does not apply any optimization steps which could result loss of information such as image quality degradation, loss of fonts etc.

2.6.1. Stream Compression Options

Contents of stream objects can be encoded with various techniques, such as LZW compression or Flate compression. The techniques are referred as filters. An application program that produces a PDF file can encode certain information (for example, data for sampled images) to compress it or to convert it to a portable ASCII representation. Then an application that reads (consumes) the PDF file can invoke the corresponding decoding filter to convert the information back to its original form.

stream compression options is specified through --stream-data switches and can be one of three values: none, preserve or compress.

Table 2.1. Stream Compression Options

ValueDescription
noneWith this option all streams are uncompressed. This is useful if you want to look at the plain text content.
preservePreserve the state of current streams. i.e., if the original stream is compressed, the output PDF is also compressed.
compressApply deflate compression on all data streams, unless application could cause adverse effects.

PDFLeo understands many filters. However, in terms of compression, Flate filter is utilized exclusively (another compression filter is LZW, which does not offer better performance). However, pdfleo will preserve stream data when it encounters an unknown filter or a loss filter such as DCTDecode. For streams with predictor, it will decompress as required, but will preserve them at compression.

Metadata streams are handled differently in pdfleo. In order for metadata is search able, the stream should be in clear text (uncompressed and unencrypted). Therefore, PDFLeo does not compress metadata streams even --stream-data=compress is specified. This behavior happens when the PDF document is either unencrypted, or encrypted with clear text metadata. If metadata is encrypted, metadata streams will also be compressed.

The command below compresses data streams to reduce file size:

  
C:\>pdfleo --data-streams=compress source.pdf target.pdf
  

Occasionally some users may want the data streams in clear text, so that they can see the drawing commands in each page. The following command line produces such output.

C:\>pdfleo --data-streams=uncompress source.pdf target.pdf
  

2.6.2. Object Stream Options

PDF file size can be substantially reduced by placing uncompressed PDF objects into a stream and compressing the stream. This type of stream is called Object Stream. Object streams option is specified through --object-streams, and can be one of the three values below:

Table 2.2. Object Stream Options

ValueDescription
preserveOriginal object streams are preserved. If the source document does not contain object streams they are not generated.
disableRemoving object streams. This often results a PDF file with bigger size.
generateGenerating object streams whenever possible.

The following command compresses all data streams, and placing object into object streams whenever possible, with the goal to minimize the file size.

C:\>pdfleo --data-streams=compress \
   --object-streams=generate source.pdf target.pdf
  

The default option is preserve.

2.7. Querying Document Information

You can obtain various of metadata of a PDF document by using option --info, or abbreviated option -i. The output first lists all information dictionary entries, followed by document security attributes such as security method and permission. The last section lists all the fonts required by the document (either embedded or required to be present in the system).

Below is the sample output:

C:\pdfleo --info Brother_HL_4050_CDN_Manual.pdf
Morovia (R) pdfleo 32-bit Professional Version 1.0
           File: Brother_HL_4050_CDN_Manual.pdf
          Title: HL4040CN_HL4050CDN_HL4070CDW.book 1
         Author: ZZPZ3635
        Subject: N/A
       Keywords: N/A
        Created: 06/29/2007 10:38:30 AM
       Modified: 06/29/2007 04:05:36 PM
    Application: FrameMaker 7.0
   PDF Producer: Acrobat Distiller 6.0 (Windows)
    PDF Version: 1.5 (Acrobat 6.x)
Number of Pages: 211
     Tagged PDF: No
     Linearized: Yes
      Page Size: 8.50x11.00 in
================ Document Security ==============================
Security Method: Password Security               2
  Authorized by: User Password
          Print: Allowed
         Modify: Not Allowed
        Extract: Allowed
       Annotate: Not Allowed
Encryption Level: RC4 (40-bit)

================ Fonts Info =====================================
Font Name                                Encoding     Type
---------------------------------------- ------------ ------------
TT9A1o00(embedded subset)                WinAnsi      Type 1C 3
TT9A2o00(embedded subset)                WinAnsi      Type 1C
TT9A3o00(embedded subset)                WinAnsi      Type 1C
TT9A4o00(embedded subset)                WinAnsi      Type 1C
TT9A5o00(embedded subset)                WinAnsi      Type 1C
 ... (remaining skipped)

1

Document metadata

2

Security properties

3

List of document fonts

2.8. Inserting and Modifying Info Dictionary

PDF stores metadata entries in a special dictionary, called information dictionary. PDF defines several standard keys, and authors can define custom entries. Often info dictionary is called native. Because the limitations of the dictionary, multiple language values are not supported.

Despite the efforts to push XMP adoption, many PDF software read the info dictionary for metadata. Therefore, it is necessary to keep the info dictionary and XMP metadata in synchronization.

PDFLeo allows users to insert, modify or remove entries in the information dictionary. Changes will synchronize to XMP metadata.

The following command snippet changes the document title to PDFLeo User Manual, document producer to Morovia PDF Writer. If the specified key does not exist, it will be added.

--info-dict="Title=PDFLeo User Manual;Producer=Morovia PDF Writer"
  

By specifying empty value you can remove an entry. For example,

--info-dict="Title=;Producer:Morovia PDF Writer"
  

removes the Title entry.

2.8.1. Standard Information Dictionary Entries

Standard entries in the dictionary are listed as below:

Table 2.3. Entries in the document information dictionary

KeyValue
TitleThe document's title
AuthorThe name of the person who created the document.
SubjectThe subject of the document
KeywordsKeywords associated with the document.
CreatorThe name of the application that created the original text (such as Microsoft Word).
ProducerThe name of the application that converted the original document into the PDF format (such as PDF printer)
CreationDateThe date and time that the document is created.
ModDateThe date and time that the document is modified.
TrappedA value indicating the document has been modified to include trapping information. Valid values include: True, False and Unknown.

Note that the keys are case sensitive. For entries requiring a Date (such as ModDate and CreationDate), the value must conform to the date format defined in the PDF standard, which you can find in Section A.1, “PDF Date Format”.

2.9. XMP Metadata Processing

XMP (Extensible Metadata Platform1) is an XML framework with many predefined properties. However, as the name implies, XMP can be extended to satisfy specific requirements using custom extension schema. XMP is much more powerful than document information dictionary, and is for example required in the PDF/A standard. Many industry groups have published standards based on XMP for various vertical applications, e.g. digital imaging or prepress data exchange.

With pdfleo you can extract duocument level XMP metadata, replace or merge existing metadata with contents from an external file.

2.9.1. Extracting XMP

XMP can be extracted through --dump-xmp switch. The dump content is expressed in UTF-8 format.

Figure below shows the results of running pdfleo with --dump-xmp option on the xmpguide.pdf. Note that the metadata contains five schema: PDF schema (http://ns.adobe.com/pdf/1.3/", XMP Basic Schema (http://ns.adobe.com/xap/1.0/"), Dublic Core Schema (http://purl.org/dc/elements/1.1/"), XMP PDFX Schema (http://ns.adobe.com/pdfx/1.3/") and XMP Media Management Schema (http://ns.adobe.com/xap/1.0/mm/").

PDFLeo provides another switch, --dump-xmp-text to display metadata in a more readable format:

c:\>pdfleo --dump-xmp-text xmpspec3.pdf 
dumping metadata:Dumping XMPMeta object ""  (0x0)

   pdf:  http://ns.adobe.com/pdf/1.3/  (0x80000000 : schema)
      pdf:Copyright = "Copyright 2010, Adobe Systems ...
      pdf:Marked = "True"
      pdf:Producer = "Acrobat Distiller 8.1.0 (Windows)"
      pdf:Keywords = "XMP metadata  EXIF TIFF IPTC PSIR"
      ... 

2.9.2. Replacing XMP metadata

You can replace existing XMP metadata with contents of an external file by using --replace-xmp switch. The file must be a valid XML file with UTF-8 encoding.

The following example replaces metadata with contents from a file named ticks.xmp:

C:\pdfleo --replace-xmp=ticks.xmp source.pdf output.pdf
  

Note that you should make sure that the content is correct. The file should be coded with UTF-8. And the content of metadata is not synchronized to the info dictionaries.

2.9.3. Merging XMP metadata

PDFLeo provides another switch, --merge-xmp to allow users to merge contents of an external file with the existing metadata. New properties are added, and old properties are replaced.

The following example updates metadata with contents from a file named ticks.xmp:

C:\pdfleo --merge-xmp=ticks.xmp source.pdf output.pdf
  

Note that you should make sure that the content is correct. The file should be coded with UTF-8. And the content of metadata is not synchronized to the info dictionaries.

Do you know?

This manual is specific to PDFLeo 1.0.
The current version is 1.0.0.

This Manual is also available in the following format: PDF.