This is a compilation of information concerning PDF methods for course web sites. This information was gathered while supporting the BMB170 and BMB178 courses. Please send corrections or comments to: mathog A T caltech D o T edu.
Top of pageWhen distributing PDF files electronically, all else being equal, it is preferable that they be as small as possible. While it is not possible in all cases to achieve size reduction, with a little care PDF files which were once many megabytes may be reduced to tens of kilobytes - with no loss of content. In fact, the graphics in the smaller files often look better than in the larger ones. The methods described here are specifically for this scenario and are not appropriate for PDFs intended for prepress or other applications where formatting is to be controlled to the pixel.
Top of pageMost PDF generators have target settings: default, screen, ebook, printer, prepress. Each of these sets default values for many of the PDF generation parameters. In general if nothing else is changed, setting the target to /screen will result in the smallest PDF file, and setting it to /prepress in the largest. (Unless the factors discussed in this document are taken into account the resulting files are highly unlikely to be of equal display quality!) Unfortunately in most PDF creation software the side effects of changing the target are not evident in the user interface. For instance, if the target is set to /screen in PDF Creator, the color image compression is restricted to a maximum of 72 dpi but the printer driver interface does not reflect this, which can be very confusing if one has it set to say 300 dpi, prints a PDF, reduces it to 150 dpi, prints again, and both files are the same size. The following image is provided by primopdf.com (the original file, with clickable links, is here) and summarizes these interactions:
When a document is composed in an application like Microsoft Word or Powerpoint one is (almost) completely free to choose fonts. When the document is converted to a PDF file those fonts, or a subset comprising just the characters which were used, are embedded in that file. The benefit of this is that it allows the PDF file to be viewed on a remote system exactly as it appeared on the originators, even when the recipient's system has no copy of these fonts installed. The cost of this is size - embedding fonts makes the PDF file larger. Depending on which fonts are used the file may be much larger. There are usually two parameters that control this process called something like embed and subset. Their action is moderately complex, since some fonts will be subsetted even if subset is not set, other fonts may be embedded even if embed is not set, and the values indicated in the user interface may be silently overridden by the PDF target setting.
In any case, not all fonts must be embedded. The PDF specification requires that all compatible applications support the standard Type 1, or "base 14", fonts natively. These fonts are:
Times (regular, italic, bold, and bold italic) Courier (regular, oblique, bold and bold oblique) Helvetica (regular, oblique, bold and bold oblique) Symbol Zapf Dingbats
Moreover, some of these fonts will be used when similar fonts were specified. Common substitutions are:
Times for Times New Roman Helvetica for Arial Courier for Courier New
Example files were printed from MS Word 2003 through PDF Creator 1.2.0, PDF target is /printer, PDF 1.5, with font settings for embed and substitute as indicated:
File | Size | Type | Embed | Subset | Description |
Example1.doc | 19968 | MS Word | NA | NA | text in 3 non base 14 fonts |
Example1.pdf | 29121 | Y | Y | /printer setting. 3 non base 14 fonts | |
Example1b.pdf | 5783 | N | Y | /printer setting, 3 non base 14 fonts | |
Example2.doc | 19968 | MS Word | NA | NA | text in 3 base 14 fonts |
Example2.pdf | 16315 | Y | Y | 3 base 14 fonts | |
Example2b.pdf | 4035 | N | Y | 3 base 14 fonts | |
Example2c.pdf | 184058 | Y | N | 3 base 14 fonts | |
Example2d.pdf | 4035 | N | N | 3 base 14 fonts |
These examples illustrate that base 14 fonts are roughly as large as non base 14 fonts when embedded, that embedding a subset saves a lot of space over embedding the entire font, and that embedding no fonts produces the smallest PDFs. However, the key point is that when non-base 14 fonts are not embedded the resulting PDF is not portable. So Example1b.pdf will not necessarily display correctly everywhere, but Example2b.pdf will. Note also that there is a 46X difference in size between the smallest and largest PDF files which will display correctly - with no difference whatsoever in the image which it contains.
Top of pageIf only base 14 fonts are to be used, and none of these are to be embedded, one must be exceedingly careful that the application in use does not use an unintended font. If it does the resulting PDF file will not be portable. Here are some of the known instances where fonts are "out of control":
Application | Symbol(s) | Description |
MS Word 2003 | ≤, ≥ | Be sure to use Symbol font, characters 163 or 179. Any other font, even a base 14 font, will uses unicode characters 2264 or 2265, which are not in the base 14 font set. |
MS Word 2003 | autonumbered spaces | The spaces between the numbers in autonumbering and the text lines will always be in Arial font. This is harmless since it maps to Helvetica, which is a base 14 font. Consequently autonumbered Word documents do not cause problems when saved in PDF files without embedded fonts. Just be aware that the PDF will list that it uses a Helvetica font even though that font does not appear anywhere in the source Word document. |
Various | Math Formula Symbols | The summation, product, and integral symbols may not be part of the base 14 fonts. These are all in the Symbol font but may not look as nice as what the original program used. Depending on the program, it may or may not be possible to replace these symbols in a formula with the one from the Symbol font. If not, fonts must be embedded or the resulting PDF will not be portable. Some applications may embed these fonts even when instructed not to. |
Example files were printed from MS Word 2003 through PDF Creator 1.2.0, PDF target is /printer, PDF 1.5, with font settings for embed and substitute as indicated:
File | Size | Type | Embed | Subset | Description or [Size (Font Descriptor Name)] |
Example3.doc | 19968 | MS Word | NA | NA | Source document |
Example3.pdf | 29194 | Y | Y |
6653 (/HNPULM+Times-Roman /Type1) 1430 (/QABGUU+Symbol /Type1) 6626 (/MEDLMS+TimesNewRoman /TrueType) 532 (/QEGMJO+Helvetica /Type1) 3536 (/CXOYBI+Symbol /TrueType) 1528 (/JZJELE+Times-Italic /Type1) |
|
Example3b.pdf | 9738 | N | Y |
1430 (/QABGUU+Symbol /Type1) 168 (/TimesNewRoman /TrueType) 161 (/Symbol /TrueType) |
|
Example3c.pdf | 127448 | Y | N |
34629 (/Times-Roman /Type1) 17018 (/Symbol /Type1) 6619 (/TimesNewRoman /TrueType) 23071 (/Helvetica /Type1) 3529 (/Symbol /TrueType) 33734 (/Times-Italic /Type1) |
|
Example3d.pdf | 25320 | N | N |
17018 (/Symbol /Type1) 168 (/TimesNewRoman /TrueType) 161 (/Symbol /TrueType) |
The names of Subsetted fonts begin with AAAAAA+, where the A's are replaced by 6 letters. Analysis of these files indicates that in all cases all (17018) or a subset (1430) of a symbol font is embedded, whether or not the embed flag is active. The subset setting is active here even in the absence of embed being set, resulting in the smallest PDF being obtained when embed is off and subset is on. The PDF XChange Viewer can display the font used for selected text. If Example3.pdf and Example3b.pdf are opened in that program one can align the pages in each tab and alternate between them, which lets the tiny formatting changes be visualized. In this way one can see that the integral character in the Symbol font is a TrueType font (not from the base 14 Symbol font, which is a Type 1 font) in both cases, but is from an embedded subset in the first file which tells us that it is coming from the Symbol font indicated in purple in the table above.
Top of pagePDF files frequently contain images in addition to text. PDF files can contain two types of images: vector graphics and bit maps. The former consists of a series of operations like "draw a line of this length in this position" from which the final image is built up, whereas the latter is an array of values specifying the color of every cell in an array of pixels. Their respective properties are summarized in the table below:
Property | Vector | Bitmap |
Compression | Lossless | Lossy (Usually) / Lossless |
Compression Artifacts | No | Y (Usually) / N |
Encoding | Flate | DCT |
Resolution | Limited by Viewer | Limited by PDF generation parameters |
Size | Proportional to number of elements | Proportional to W x H of image |
Supports Transparency? | In theory yes, Typically no | Yes |
Best For | Line Drawings, Diagrams | Images |
Vector graphics are usually the best way to represent diagrams in PDF files. The image is sharp at all magnifications, small in size, and not subject to compression artifacts. The trick is to get these graphics from the source document into the PDF file without triggering an automatic rasterization and hence conversion to a bitmap. The automatic rasterization happens whenever the gamut of vector graphics operations employed in a program like Word, Illustrator, or LibreOffice Draw is larger than that supported by the PDF standard. Additionally, on Windows many of the PDF generators appear as Postscript printers, and Postscript does not support transparency in vector graphics, whereas PDF does. Consequently a single vector graphic with an "alpha channel" or opacity value other than 1.0 (255), or a transparency different than zero, may convert the entire diagram into a bitmap when printed to a PDF, whereas it may successfully maintain its vector nature if an "export PDF" option is used instead.
Example files were printed from MS Word 2003 through PDF Creator 1.2.0, PDF target is /printer, PDF 1.5, with font settings embed=F and substitute=T:
File | Size | Type | Description |
Example4.doc | 24576 | MS Word | Source for two diagrams, no issues converting to PDF as vector. |
Example5.doc | 24576 | MS Word | Error in diagram 1, transparency not 0 in red circle. |
Example6.odt | 13033 | LibreOffice Writer | Diagram 2 has a 32 step gradient. |
Example7.doc | 47104 | MS Word | Like Example 4 but both diagrams are bitmaps instead of vector graphics. |
Example8.svg | 5056 | Inkscape | Like diagram 1 in Example 5. |
Example4.pdf | 3657 | Printed from MS Word 2003, still vector form. | |
Example4b.pdf | 6732 | Printed from LibreOffice Writer, still vector form. | |
Example5.pdf | 4055 | Printed from MS Word 2003, transparency achieved by dithering, still vector form (but ugly). | |
Example5b.pdf | 22339 | Printed from LibreOffice Writer, nonzero transparency rasterizes all of diagram 1. | |
Example6b.pdf | 8115 | Printed from LibreOffice Writer, still vector, more elements. | |
Example7.pdf | 35930 | Printed from MS Word 2003, both diagrams were bitmaps in the source. | |
Example8.pdf | 5675 | Printed from Inkscape, only the red circle is rasterized, the rest of the diagram remains vector. |
There is a lot of variation in the rasterization, in some programs only the offending vector element is rasterized and the rest stay as vectors (this may depend on exactly how the diagram is constructed.) Examples 5b and 8 are examples of different programs rasterizing differently on what are essentially the same figure. Using pdfsummary (see software below) the images in the files can be listed, which shows that the rasterization for Example5b.pdf was:
Image (Object): Size (Description) PP:whereas for Example8.pdf it was:1 ( 7): 4756 (W=256 H=256 Bpc=8 DCT) PP: 1 2 ( 8): 4622 (W=256 H=256 Bpc=8 DCT) PP: 1 3 ( 9): 2795 (W=200 H=256 Bpc=8 DCT) PP: 1 4 ( 10): 2174 (W=256 H=100 Bpc=8 DCT) PP: 1 5 ( 11): 2051 (W=256 H=100 Bpc=8 DCT) PP: 1 6 ( 12): 1979 (W=200 H=100 Bpc=8 DCT) PP: 1
Image (Object): Size (Description) PP:So when confronted with a single small transparent object, Libre Office rasterized the image in chunks (we cannot tell from this information how they are tiled), whereas Inkspace only rasterized the small region around the offending transparent red circle. Notice that there is a poor correlation between the size of the image (as W x H) and the number of bytes, this is beause most of the diagram is composed of large blocks of a single color, which compresses very well. Top of page1 ( 9): 1259 (W=34 H=33 Bpc=8 DCT) PP: 1
Bitmaps in source documents are characterized by colorspace (color, grayscale, or black and white) and array size (Width X Height in pixels). When emitted to a PDF file they are in effect pre-rendered onto the display surface whose resolution is specified in dots per inch (dpi), and the resulting new bitmap, as it would appear on the display surface, is compressed using one of the methods in the table below and stored in the PDF file. To decrease the size of the image in the PDF file the resolution (W,H) of the image may be reduced in the source document. This does not mean resizing the image frame, which usually does nothing to the resolution. Rather, some programs provide a method for resampling the image at a lower resolution. If not, an equivalent operation is provided by the PDF generating software, which usually allows the target dpi to be reduced below that set by the "target" default. For instance, for the printer target the default color and grayspace resolution is 300 dpi, but 150 dpi may be adequate. PDF generating software generally provides for different dpi limits and compression settings for each colorspace, but provides no method to set these parameters on an image by image basis. This is unfortunate because 9 out of 10 images may look OK at a high compression setting, but the 10th may look terrible. Decreasing the compression to make the 10th image acceptable may greatly increase the size of the PDF file. One workaround for this is to either reduce the resolution on the 9 images and decrease the compression, or to increase the resolution on the 10th image by oversampling it. In some instances, for instance with PDF Creator, it is possible to print to a PDF in page ranges, so that all print operations eventually end up in the same PDF file. In this manner the compression and dpi settings may be set on a page by page basis.
Bitmap image compression methods are as follows:
Method | Type(s) | Lossy? | Compression Factor | Notes |
JPEG | Color,Greyscale | Y | Parameter | Resolution falls and artifacts increase as compression rises. Best for images of real world objects, often poor or unaceptable for diagrams and other "sharp" images. |
JPEG2000 | Color,Greyscale | Y | Parameter | Available PDF 1.5 and up. Newer version of JPEG, not widely used at this time. |
ZIP | Color,Greyscale | N | unpredictable and usually low | Works best on poster like images, with large blocks of the same color. |
CCITT Group 4 FAX | BW | N | 15:1 or better is common | Default in older PDF generators. Use JBIG2 if it is available. |
Run Length Encoding | BW | N | varies with image | Works well on scanned BW text, which is mostly white. |
JBIG2 | BW | Y or N | varies with image | Available PDF 1.4 and up. Usually better compression than CCITT Group 4 but processing takes longer. |
Some instructors use the Powerpoint notes feature to associate extra information with their lectures. If the full Acrobat X (and probably some earlier versions) is installed, and the PDF is made by (menu) Adobe PDF -> Convert to Adobe PDF, the presence of the comments will be automatically detected and a dialog will ask if they should be included. If the answer is affirmative the PDF produced will have a comments section.
However, if the presentation is printed to PDF through any of the postscript based printer drivers, like PDFCreator or even Acrobat's Adobe PDF listed under printers, the comment information is lost unless one employs the Print what: -> Notes pages option in the print dialog. Doing so will cause the notes to use up half the page, reducing the size of the slide. Here is a method to extract the notes information from the PPT file and insert it into the PDF file under an icon, so that the information can be seen, or not, as the user desires. The result is similar to what Acrobat produces, but the full Acrobat need not be installed. The intermediate file may be edited (carefully!) like any other text file to modify, add, or remove comments.
In PPT 1. tools -> macro ->security set to medium 2. alt f11 to start VBA editor 3. make sure project is highlighted in left hand pane insert->module 4. Paste in the following, which is a script that converts PPT notes to an .xfdf file. Be careful about line wrapping. The "<body" line, for instance, is very long, and it has to go in as one line in the module. Sub ExportNotesXFDF() ' ' David Mathog, Caltech 3/2/2011. version 0.0.1 ' Export PPT notes to .xfdf format so that they may ' be applied as comments to a PDF file. This is a modification of the text export ' module here: http://www.pptfaq.com/FAQ00481.htm ' ' The timezone for the generator is hardcoded, look for "strPDFtz". There should be ' a way to set it in the script, not sure how though. ' ' Dim oSlides As Slides Dim oSl As Slide Dim oSh As Shape Dim strFileName As String Dim intFileNum As Integer Dim lngReturn As Long Dim intNewSlide As Integer Dim strPDFcomments As String ' Variable for accumulating PPT notes, per slide Dim intPDFpage As Integer ' Page numbers inside PDF, 0 to N-1 Dim strPDFqpos As String ' location for the "?" icon Dim strPDFtpos As String ' location for the text pop up Dim strPDFdate As String Dim strPDFqstyle As String ' style for the "?" Dim strPDFtstyle As String ' style for the comment text Dim strPDFtz As String ' Time zone where this was generated Dim strPDFtitle As String ' like "Slide 1 additional notes" Dim strPDFpagename As String ' like "slide2" Dim strPDFtext As String ' all comment text for this slide Dim strPDFxeol As String ' line delimiter within a comment strPDFqpos = "1,-18,21,2" ' rectangle position for ? strPDFtpos = "0,-97,529,0" ' rectangle position for text box strPDFxeol = " " strPDFtz = "-08'00'" ' PST strPDFdate = "D:" _ & Format$(Now, "yyyy") _ & Format$(Now, "mm") _ & Format$(Now, "dd") _ & Format$(Now, "hh") _ & Format$(Now, "nn") _ & Format$(Now, "ss") _ & strPDFtz intNewSlide = 1 strPDFqstyle = "font-size:12.0pt;text-align:left;" _ & "color:#000000;font-weight:normal;font-style:" _ & "normal;font-family:Arial;font-stretch:normal" strPDFtstyle = "font-family:Arial;font-size:12.0pt" ' Get a filename to store the collected text strFileName = InputBox("Enter the full path and name of file to hold extracted notes (as xfdf)", "Output file?") ' did user cancel? If strFileName = "" Then Exit Sub End If ' is the path valid? crude but effective test: try to create the file. intFileNum = FreeFile() On Error Resume Next Open strFileName For Output As intFileNum If Err.Number <> 0 Then ' we have a problem MsgBox "Couldn't create the file: " & strFileName & vbCrLf _ & "Please try again." Exit Sub End If Close #intFileNum ' temporarily ' Get the notes text Set oSlides = ActivePresentation.Slides For Each oSl In oSlides For Each oSh In oSl.NotesPage.Shapes If oSh.PlaceholderFormat.Type = ppPlaceholderBody Then If oSh.HasTextFrame Then If oSh.TextFrame.HasText Then If intNewSlide > 0 Then intNewSlide = 0 intPDFpage = oSl.SlideIndex - 1 strPDFpagename = "Slide" & CStr(oSl.SlideIndex) strPDFcomments = strPDFcomments _ & "<text icon=""Help""" _ & " title=""" & strPDFpagename & " additional notes""" _ & " creationdate=""" & strPDFdate & """" _ & " subject=""Sticky Note""" _ & " page=""" & CStr(intPDFpage) & """" _ & " flags=""print,nozoom,norotate""" _ & " Name=""" & strPDFpagename & """" _ & " rect=""" & strPDFqpos & """" _ & " color=""#FFFF00"">" & vbCrLf strPDFcomments = strPDFcomments _ & "<contents-richtext>" & vbCrLf _ & "<body xmlns=""http://www.w3.org/1999/xhtml"" xmlns:xfa=""http://www.xfa.org/schema/xfa-data/1.0/"" xfa:APIVersion=""Acrobat:7.0.8"" xfa:spec=""2.0.2""" _ & " style=""" & strPDFqstyle & """>" & vbCrLf End If strPDFcomments = strPDFcomments _ & "<p><span style=""" & strPDFtstyle & """>" _ & oSh.TextFrame.TextRange.Text & strPDFxeol End If End If End If Next oSh ' close out the xml for this slide strPDFcomments = strPDFcomments _ & "</span></p></body></contents-richtext>" & vbCrLf _ & "<popup open=""no""" _ & " page=""" & CStr(intPDFpage) & """" _ & " Date=""" & strPDFdate & """" _ & " flags=""invisible,nozoom,norotate""" _ & " Name=""" & strPDFpagename & """" _ & " rect=""" & strPDFtpos & """/>" & vbCrLf _ & " </text>" intNewSlide = 1 Next oSl ' now write the text to file Open strFileName For Output As intFileNum ' Print the header Print #intFileNum, "<?xml version=""; 1#; "" encoding=""; UTF - 8; ""?>" Print #intFileNum, "<xfdf xmlns=""http://ns.adobe.com/xfdf/"" xml:space=""preserve"">" Print #intFileNum, "<annots>" Print #intFileNum, strPDFcomments ' Print the footer Print #intFileNum, "</annots>" Print #intFileNum, "</xfdf>" Close #intFileNum ' show what we've done lngReturn = Shell("NOTEPAD.EXE " & strFileName, vbNormalFocus) End Sub 5. run -> run sub name the file and save it In PDF-Xchange viewer: 1. comments -> import comments 2. select xfdf format in the file selector 3. select the file you just created Bingo, the slides should now be commented. If new Powerpoint presentations are made by copying this one and deleting the slides the module will tag along and won't need to be created again.Top of page
Free PDF software used in preparing these notes is described in the table below.
Program | Description |
PDF XChange Viewer | PDF viewer for Windows, excellent alternative to Acrobat |
PDF Creator | PDF generator for Windows, exellent alternative to Distiller. Sets up a Postscript printer which feeds Ghostscript to produce PDF |
PDF Summary | (Linux) Scripts to analyze the content of PDF files - helps answer the question: what is taking up space? Requires Python and Perl. |