PDF notes

PDF files for classes - factors to consider


Table of contents

Top of page

Introduction

This is a compilation of information concerning PDF methods for course web sites. This information was gathered while supporting the BMB170 and BMB178 courses. Please send corrections or comments to: mathog A T caltech D o T edu.

Top of page

Making smaller PDF files

When distributing PDF files electronically, all else being equal, it is preferable that they be as small as possible. While it is not possible in all cases to achieve size reduction, with a little care PDF files which were once many megabytes may be reduced to tens of kilobytes - with no loss of content. In fact, the graphics in the smaller files often look better than in the larger ones. The methods described here are specifically for this scenario and are not appropriate for PDFs intended for prepress or other applications where formatting is to be controlled to the pixel.

Top of page

PDF Targets

Most PDF generators have target settings: default, screen, ebook, printer, prepress. Each of these sets default values for many of the PDF generation parameters. In general if nothing else is changed, setting the target to /screen will result in the smallest PDF file, and setting it to /prepress in the largest. (Unless the factors discussed in this document are taken into account the resulting files are highly unlikely to be of equal display quality!) Unfortunately in most PDF creation software the side effects of changing the target are not evident in the user interface. For instance, if the target is set to /screen in PDF Creator, the color image compression is restricted to a maximum of 72 dpi but the printer driver interface does not reflect this, which can be very confusing if one has it set to say 300 dpi, prints a PDF, reduces it to 150 dpi, prints again, and both files are the same size. The following image is provided by primopdf.com (the original file, with clickable links, is here) and summarizes these interactions:

Top of page

Fonts - The basics

When a document is composed in an application like Microsoft Word or Powerpoint one is (almost) completely free to choose fonts. When the document is converted to a PDF file those fonts, or a subset comprising just the characters which were used, are embedded in that file. The benefit of this is that it allows the PDF file to be viewed on a remote system exactly as it appeared on the originators, even when the recipient's system has no copy of these fonts installed. The cost of this is size - embedding fonts makes the PDF file larger. Depending on which fonts are used the file may be much larger. There are usually two parameters that control this process called something like embed and subset. Their action is moderately complex, since some fonts will be subsetted even if subset is not set, other fonts may be embedded even if embed is not set, and the values indicated in the user interface may be silently overridden by the PDF target setting.

In any case, not all fonts must be embedded. The PDF specification requires that all compatible applications support the standard Type 1, or "base 14", fonts natively. These fonts are:

  Times (regular, italic, bold, and bold italic)
  Courier (regular, oblique, bold and bold oblique)
  Helvetica (regular, oblique, bold and bold oblique)
  Symbol
  Zapf Dingbats

Moreover, some of these fonts will be used when similar fonts were specified. Common substitutions are:

  Times for Times New Roman
  Helvetica for Arial
  Courier for Courier New

Example files were printed from MS Word 2003 through PDF Creator 1.2.0, PDF target is /printer, PDF 1.5, with font settings for embed and substitute as indicated:
File Size Type Embed Subset Description
Example1.doc 19968 MS Word NA NA text in 3 non base 14 fonts
Example1.pdf 29121 PDF Y Y /printer setting. 3 non base 14 fonts
Example1b.pdf 5783 PDF N Y /printer setting, 3 non base 14 fonts
Example2.doc 19968 MS Word NA NA text in 3 base 14 fonts
Example2.pdf 16315 PDF Y Y 3 base 14 fonts
Example2b.pdf 4035 PDF N Y 3 base 14 fonts
Example2c.pdf 184058 PDF Y N 3 base 14 fonts
Example2d.pdf 4035 PDF N N 3 base 14 fonts

These examples illustrate that base 14 fonts are roughly as large as non base 14 fonts when embedded, that embedding a subset saves a lot of space over embedding the entire font, and that embedding no fonts produces the smallest PDFs. However, the key point is that when non-base 14 fonts are not embedded the resulting PDF is not portable. So Example1b.pdf will not necessarily display correctly everywhere, but Example2b.pdf will. Note also that there is a 46X difference in size between the smallest and largest PDF files which will display correctly - with no difference whatsoever in the image which it contains.

Top of page

Fonts - The Gotchas

If only base 14 fonts are to be used, and none of these are to be embedded, one must be exceedingly careful that the application in use does not use an unintended font. If it does the resulting PDF file will not be portable. Here are some of the known instances where fonts are "out of control":

Application Symbol(s) Description
MS Word 2003 ≤, ≥ Be sure to use Symbol font, characters 163 or 179. Any other font, even a base 14 font, will uses unicode characters 2264 or 2265, which are not in the base 14 font set.
MS Word 2003 autonumbered spaces The spaces between the numbers in autonumbering and the text lines will always be in Arial font. This is harmless since it maps to Helvetica, which is a base 14 font. Consequently autonumbered Word documents do not cause problems when saved in PDF files without embedded fonts. Just be aware that the PDF will list that it uses a Helvetica font even though that font does not appear anywhere in the source Word document.
Various Math Formula Symbols The summation, product, and integral symbols may not be part of the base 14 fonts. These are all in the Symbol font but may not look as nice as what the original program used. Depending on the program, it may or may not be possible to replace these symbols in a formula with the one from the Symbol font. If not, fonts must be embedded or the resulting PDF will not be portable. Some applications may embed these fonts even when instructed not to.

Example files were printed from MS Word 2003 through PDF Creator 1.2.0, PDF target is /printer, PDF 1.5, with font settings for embed and substitute as indicated:
File Size Type Embed Subset Description or [Size (Font Descriptor Name)]
Example3.doc 19968 MS Word NA NA Source document
Example3.pdf 29194 PDF Y Y
   6653 (/HNPULM+Times-Roman /Type1)
   1430 (/QABGUU+Symbol /Type1)
   6626 (/MEDLMS+TimesNewRoman /TrueType)
    532 (/QEGMJO+Helvetica /Type1)
   3536 (/CXOYBI+Symbol /TrueType)
   1528 (/JZJELE+Times-Italic /Type1)
Example3b.pdf 9738 PDF N Y
   1430 (/QABGUU+Symbol /Type1)
    168 (/TimesNewRoman /TrueType)
    161 (/Symbol /TrueType)
Example3c.pdf 127448 PDF Y N
   34629 (/Times-Roman /Type1)
   17018 (/Symbol /Type1)
    6619 (/TimesNewRoman /TrueType)
   23071 (/Helvetica /Type1)
    3529 (/Symbol /TrueType)
   33734 (/Times-Italic /Type1)
Example3d.pdf 25320 PDF N N
   17018 (/Symbol /Type1)
     168 (/TimesNewRoman /TrueType)
     161 (/Symbol /TrueType)

The names of Subsetted fonts begin with AAAAAA+, where the A's are replaced by 6 letters. Analysis of these files indicates that in all cases all (17018) or a subset (1430) of a symbol font is embedded, whether or not the embed flag is active. The subset setting is active here even in the absence of embed being set, resulting in the smallest PDF being obtained when embed is off and subset is on. The PDF XChange Viewer can display the font used for selected text. If Example3.pdf and Example3b.pdf are opened in that program one can align the pages in each tab and alternate between them, which lets the tiny formatting changes be visualized. In this way one can see that the integral character in the Symbol font is a TrueType font (not from the base 14 Symbol font, which is a Type 1 font) in both cases, but is from an embedded subset in the first file which tells us that it is coming from the Symbol font indicated in purple in the table above.

Top of page

Images - Basics

PDF files frequently contain images in addition to text. PDF files can contain two types of images: vector graphics and bit maps. The former consists of a series of operations like "draw a line of this length in this position" from which the final image is built up, whereas the latter is an array of values specifying the color of every cell in an array of pixels. Their respective properties are summarized in the table below:
Property Vector Bitmap
Compression Lossless Lossy (Usually) / Lossless
Compression Artifacts No Y (Usually) / N
Encoding Flate DCT
Resolution Limited by Viewer Limited by PDF generation parameters
Size Proportional to number of elements Proportional to W x H of image
Supports Transparency? In theory yes, Typically no Yes
Best For Line Drawings, Diagrams Images

Top of page

Images - Vector Graphics

Vector graphics are usually the best way to represent diagrams in PDF files. The image is sharp at all magnifications, small in size, and not subject to compression artifacts. The trick is to get these graphics from the source document into the PDF file without triggering an automatic rasterization and hence conversion to a bitmap. The automatic rasterization happens whenever the gamut of vector graphics operations employed in a program like Word, Illustrator, or LibreOffice Draw is larger than that supported by the PDF standard. Additionally, on Windows many of the PDF generators appear as Postscript printers, and Postscript does not support transparency in vector graphics, whereas PDF does. Consequently a single vector graphic with an "alpha channel" or opacity value other than 1.0 (255), or a transparency different than zero, may convert the entire diagram into a bitmap when printed to a PDF, whereas it may successfully maintain its vector nature if an "export PDF" option is used instead.

Example files were printed from MS Word 2003 through PDF Creator 1.2.0, PDF target is /printer, PDF 1.5, with font settings embed=F and substitute=T:
File Size Type Description
Example4.doc 24576 MS Word Source for two diagrams, no issues converting to PDF as vector.
Example5.doc 24576 MS Word Error in diagram 1, transparency not 0 in red circle.
Example6.odt 13033 LibreOffice Writer Diagram 2 has a 32 step gradient.
Example7.doc 47104 MS Word Like Example 4 but both diagrams are bitmaps instead of vector graphics.
Example8.svg 5056 Inkscape Like diagram 1 in Example 5.
Example4.pdf 3657 PDF Printed from MS Word 2003, still vector form.
Example4b.pdf 6732 PDF Printed from LibreOffice Writer, still vector form.
Example5.pdf 4055 PDF Printed from MS Word 2003, transparency achieved by dithering, still vector form (but ugly).
Example5b.pdf 22339 PDF Printed from LibreOffice Writer, nonzero transparency rasterizes all of diagram 1.
Example6b.pdf 8115 PDF Printed from LibreOffice Writer, still vector, more elements.
Example7.pdf 35930 PDF Printed from MS Word 2003, both diagrams were bitmaps in the source.
Example8.pdf 5675 PDF Printed from Inkscape, only the red circle is rasterized, the rest of the diagram remains vector.

There is a lot of variation in the rasterization, in some programs only the offending vector element is rasterized and the rest stay as vectors (this may depend on exactly how the diagram is constructed.) Examples 5b and 8 are examples of different programs rasterizing differently on what are essentially the same figure. Using pdfsummary (see software below) the images in the files can be listed, which shows that the rasterization for Example5b.pdf was:

Image (Object): Size (Description) PP: 
   1 (   7):      4756 (W=256 H=256 Bpc=8 DCT) PP: 1
   2 (   8):      4622 (W=256 H=256 Bpc=8 DCT) PP: 1
   3 (   9):      2795 (W=200 H=256 Bpc=8 DCT) PP: 1
   4 (  10):      2174 (W=256 H=100 Bpc=8 DCT) PP: 1
   5 (  11):      2051 (W=256 H=100 Bpc=8 DCT) PP: 1
   6 (  12):      1979 (W=200 H=100 Bpc=8 DCT) PP: 1
whereas for Example8.pdf it was:
Image (Object): Size (Description) PP: 
   1 (   9):      1259 (W=34 H=33 Bpc=8 DCT) PP: 1
So when confronted with a single small transparent object, Libre Office rasterized the image in chunks (we cannot tell from this information how they are tiled), whereas Inkspace only rasterized the small region around the offending transparent red circle. Notice that there is a poor correlation between the size of the image (as W x H) and the number of bytes, this is beause most of the diagram is composed of large blocks of a single color, which compresses very well.

Top of page

Images - Bitmaps

Bitmaps in source documents are characterized by colorspace (color, grayscale, or black and white) and array size (Width X Height in pixels). When emitted to a PDF file they are in effect pre-rendered onto the display surface whose resolution is specified in dots per inch (dpi), and the resulting new bitmap, as it would appear on the display surface, is compressed using one of the methods in the table below and stored in the PDF file. To decrease the size of the image in the PDF file the resolution (W,H) of the image may be reduced in the source document. This does not mean resizing the image frame, which usually does nothing to the resolution. Rather, some programs provide a method for resampling the image at a lower resolution. If not, an equivalent operation is provided by the PDF generating software, which usually allows the target dpi to be reduced below that set by the "target" default. For instance, for the printer target the default color and grayspace resolution is 300 dpi, but 150 dpi may be adequate. PDF generating software generally provides for different dpi limits and compression settings for each colorspace, but provides no method to set these parameters on an image by image basis. This is unfortunate because 9 out of 10 images may look OK at a high compression setting, but the 10th may look terrible. Decreasing the compression to make the 10th image acceptable may greatly increase the size of the PDF file. One workaround for this is to either reduce the resolution on the 9 images and decrease the compression, or to increase the resolution on the 10th image by oversampling it. In some instances, for instance with PDF Creator, it is possible to print to a PDF in page ranges, so that all print operations eventually end up in the same PDF file. In this manner the compression and dpi settings may be set on a page by page basis.

Bitmap image compression methods are as follows:
Method Type(s) Lossy? Compression Factor Notes
JPEG Color,Greyscale Y Parameter Resolution falls and artifacts increase as compression rises. Best for images of real world objects, often poor or unaceptable for diagrams and other "sharp" images.
JPEG2000 Color,Greyscale Y Parameter Available PDF 1.5 and up. Newer version of JPEG, not widely used at this time.
ZIP Color,Greyscale N unpredictable and usually low Works best on poster like images, with large blocks of the same color.
CCITT Group 4 FAX BW N 15:1 or better is common Default in older PDF generators. Use JBIG2 if it is available.
Run Length Encoding BW N varies with image Works well on scanned BW text, which is mostly white.
JBIG2 BW Y or N varies with image Available PDF 1.4 and up. Usually better compression than CCITT Group 4 but processing takes longer.

Top of page

Moving Powerpoint notes to PDF comments

Some instructors use the Powerpoint notes feature to associate extra information with their lectures. If the full Acrobat X (and probably some earlier versions) is installed, and the PDF is made by (menu) Adobe PDF -> Convert to Adobe PDF, the presence of the comments will be automatically detected and a dialog will ask if they should be included. If the answer is affirmative the PDF produced will have a comments section.

However, if the presentation is printed to PDF through any of the postscript based printer drivers, like PDFCreator or even Acrobat's Adobe PDF listed under printers, the comment information is lost unless one employs the Print what: -> Notes pages option in the print dialog. Doing so will cause the notes to use up half the page, reducing the size of the slide. Here is a method to extract the notes information from the PPT file and insert it into the PDF file under an icon, so that the information can be seen, or not, as the user desires. The result is similar to what Acrobat produces, but the full Acrobat need not be installed. The intermediate file may be edited (carefully!) like any other text file to modify, add, or remove comments.

In PPT
1. tools -> macro ->security
set to medium
2. alt f11 to start VBA editor
3. make sure project is highlighted in left hand pane
insert->module
4. Paste in the following, which is a script that converts PPT notes to an .xfdf file. Be careful about line
 wrapping.  The "<body" line, for instance, is very long, and it has to go in as one line in the module.


Sub ExportNotesXFDF()
'
' David Mathog, Caltech 3/2/2011. version 0.0.1
' Export PPT notes to .xfdf format so that they may
' be applied as comments to a PDF file.  This is a modification of the text export
' module here:  http://www.pptfaq.com/FAQ00481.htm
'
' The timezone for the generator is hardcoded, look for "strPDFtz".  There should be
' a way to set it in the script, not sure how though.
'
'

    Dim oSlides As Slides
    Dim oSl As Slide
    Dim oSh As Shape
    Dim strFileName As String
    Dim intFileNum As Integer
    Dim lngReturn As Long
    Dim intNewSlide As Integer
    Dim strPDFcomments As String  ' Variable for accumulating PPT notes, per slide
    Dim intPDFpage As Integer     ' Page numbers inside PDF, 0 to N-1
    Dim strPDFqpos As String      ' location for the "?" icon
    Dim strPDFtpos As String      ' location for the text pop up
    Dim strPDFdate As String
    Dim strPDFqstyle As String    ' style for the "?"
    Dim strPDFtstyle As String    ' style for the comment text
    Dim strPDFtz As String        ' Time zone where this was generated
    Dim strPDFtitle As String     ' like "Slide 1 additional notes"
    Dim strPDFpagename As String  ' like "slide2"
    Dim strPDFtext As String      ' all comment text for this slide
    Dim strPDFxeol As String      ' line delimiter within a comment
   
   
    strPDFqpos = "1,-18,21,2"       ' rectangle position for ?
    strPDFtpos = "0,-97,529,0"    ' rectangle position for text box
    strPDFxeol = "
"
    strPDFtz = "-08'00'"          ' PST
    strPDFdate = "D:" _
      & Format$(Now, "yyyy") _
      & Format$(Now, "mm") _
      & Format$(Now, "dd") _
      & Format$(Now, "hh") _
      & Format$(Now, "nn") _
      & Format$(Now, "ss") _
      & strPDFtz
    intNewSlide = 1
    strPDFqstyle = "font-size:12.0pt;text-align:left;" _
      & "color:#000000;font-weight:normal;font-style:" _
      & "normal;font-family:Arial;font-stretch:normal"
    strPDFtstyle = "font-family:Arial;font-size:12.0pt"
   
    ' Get a filename to store the collected text
    strFileName = InputBox("Enter the full path and name of file to hold extracted notes (as xfdf)", "Output file?")

    ' did user cancel?
    If strFileName = "" Then
        Exit Sub
    End If

    ' is the path valid?  crude but effective test:  try to create the file.
    intFileNum = FreeFile()
    On Error Resume Next
    Open strFileName For Output As intFileNum
    If Err.Number <> 0 Then     ' we have a problem
        MsgBox "Couldn't create the file: " & strFileName & vbCrLf _
            & "Please try again."
        Exit Sub
    End If
    Close #intFileNum  ' temporarily

    ' Get the notes text
    Set oSlides = ActivePresentation.Slides
    For Each oSl In oSlides
        For Each oSh In oSl.NotesPage.Shapes
        If oSh.PlaceholderFormat.Type = ppPlaceholderBody Then
            If oSh.HasTextFrame Then
                If oSh.TextFrame.HasText Then
                  If intNewSlide > 0 Then
                    intNewSlide = 0
                    intPDFpage = oSl.SlideIndex - 1
                    strPDFpagename = "Slide" & CStr(oSl.SlideIndex)
                    strPDFcomments = strPDFcomments _
                      & "<text icon=""Help""" _
                      & " title=""" & strPDFpagename & " additional notes""" _
                      & " creationdate=""" & strPDFdate & """" _
                      & " subject=""Sticky Note""" _
                      & " page=""" & CStr(intPDFpage) & """" _
                      & " flags=""print,nozoom,norotate""" _
                      & " Name=""" & strPDFpagename & """" _
                      & " rect=""" & strPDFqpos & """" _
                      & " color=""#FFFF00"">" & vbCrLf
                    strPDFcomments = strPDFcomments _
                      & "<contents-richtext>" & vbCrLf _
                      & "<body xmlns=""http://www.w3.org/1999/xhtml"" xmlns:xfa=""http://www.xfa.org/schema/xfa-data/1.0/"" xfa:APIVersion=""Acrobat:7.0.8"" xfa:spec=""2.0.2""" _
                      & " style=""" & strPDFqstyle & """>" & vbCrLf
                  End If
                  strPDFcomments = strPDFcomments _
                    & "<p><span style=""" & strPDFtstyle & """>" _
                    & oSh.TextFrame.TextRange.Text & strPDFxeol
                End If
            End If
        End If
        Next oSh
        ' close out the xml for this slide
        strPDFcomments = strPDFcomments _
        & "</span></p></body></contents-richtext>" & vbCrLf _
        & "<popup open=""no""" _
        & " page=""" & CStr(intPDFpage) & """" _
        & " Date=""" & strPDFdate & """" _
        & " flags=""invisible,nozoom,norotate""" _
        & " Name=""" & strPDFpagename & """" _
        & " rect=""" & strPDFtpos & """/>" & vbCrLf _
        & " </text>"
       
        intNewSlide = 1
    Next oSl

    ' now write the text to file
    Open strFileName For Output As intFileNum
   
    ' Print the header
    Print #intFileNum, "<?xml version=""; 1#; "" encoding=""; UTF - 8; ""?>"
    Print #intFileNum, "<xfdf xmlns=""http://ns.adobe.com/xfdf/"" xml:space=""preserve"">"
    Print #intFileNum, "<annots>"

    Print #intFileNum, strPDFcomments
   
    ' Print the footer
    Print #intFileNum, "</annots>"
    Print #intFileNum, "</xfdf>"

    Close #intFileNum

    ' show what we've done
    lngReturn = Shell("NOTEPAD.EXE " & strFileName, vbNormalFocus)

End Sub

5. run -> run sub
name the file and save it

In PDF-Xchange viewer:
1. comments -> import comments
2. select xfdf format in the file selector
3. select the file you just created

Bingo, the slides should now be commented.

If new Powerpoint presentations are made by copying this one and deleting the slides the module will tag along and won't need to be created again.



Top of page

Free PDF software

Free PDF software used in preparing these notes is described in the table below.

Program Description
PDF XChange Viewer PDF viewer for Windows, excellent alternative to Acrobat
PDF Creator PDF generator for Windows, exellent alternative to Distiller. Sets up a Postscript printer which feeds Ghostscript to produce PDF
PDF Summary (Linux) Scripts to analyze the content of PDF files - helps answer the question: what is taking up space? Requires Python and Perl.
last updated 21-Oct-2011