Wednesday, March 26, 2008

Turn your electronic document into a user interface and collect information from readers.

Traditional paper forms use page layout to show how information is structured. Sometimes, as on tax forms, these relationships get pretty complicated. PDF preserves page layout, so it is a natural way to publish forms on the Web. The next decision is, how many PDF form features should you add?

If you add no features, your users must print the form and fill it out as they would any other paper form. Then they must mail it back to you for processing. Sometimes this is all you need, but PDF is capable of more.

If you add fillable form fields to the PDF, your users can fill in the form using Acrobat or Reader. When they are done, they still must print it out and mail it to you. Acrobat users can save filled-in PDFs, but Reader users can't, which can be frustrating.

If you add fillable form fields and a Submit button that posts field data to your web server, you have joined the information revolution. Your web server can interactively validate the user's data, provide helpful feedback, record the completed data in your database, and supply the user with a savable PDF copy. Olé!

We have gotten ahead of ourselves, though. First, let's create a form that submits data to your web server. Subsequent hacks will build on this. To see online examples of interactive PDF forms, visit http://www.pdfhacks.com/form_session/. You can download our example PDF forms and PHP code from this site, too.

How to Tally Topic Popularity

Organize PDF page hits by document headings to get a sense of what readers like best.

A single long document can cover dozens of topics. Which topics do readers find most useful? Use our PDF skins to track hits to individual pages. Then, use our hit_report script to map these page hits back into your document's headings. You'll see topic hits, not ambiguous page hits. Visit http://www.pdfhacks.com/eno/ for an example.

Page hit logging is built into the pdfskins_classic_php template. After unpacking this template, activate hit logging by editing script.include and setting $log_hits=true. You can do this at any time, before or after skinning your PDF. Page hits get logged into text files located in the same directory as the skinned PDF, so the web server must have write access to that directory.

If a page is named pg_0025.pdf, its hit log is named pg_0025.pdf.hits. Each hit adds one line to the file. Each line includes the IP number that requested the page, so you can identify unique visitors if you desire.

After skinning your PDF and making sure hit logging works, visit http://www.pdfhacks.com/skins/ and download hit_report.php-1.0.zip. Unpack this single PHP file and copy it onto your web server.

If your skinned PDF is located in the directory:

http://pdfhacks.com/eno/skinned_php/

pass its location to hit_report like so:

http://pdfhacks.com/hit_report.php?pdf=/eno/skinned_php

The document outline should appear in your browser, just as it does on the skinned PDF's title page. On the right side of the page, a column of numbers shows the number of hits on each outline topic.

For sections that span multiple pages, page hits are summed to create the section-hit count. However, section-hit counts do not include subsection-hit counts. If multiple sections have headings that appear on the same document page, those sections will also share the same hit count. hit_report identifies these by giving these sections the same background color.

Provide this hit information to your readers by merging hit_report features with the current skin templates index.html and index.toc.html.

Share PDF Comments Online (Even Without Acrobat)

Use our PDF skins to add commenting features to PDF pages.

Using Acrobat, you can add various comments and annotations to PDF pages. You can also share these comments via email or by configuring Acrobat's Online Comments. These collaboration tools require all contributors to have Acrobat; they do not work with Reader. And, in general, all contributors must have the same version of Acrobat.

Instead, add online commenting features to PDF pages with our PDF skins and a couple PHP scripts. Users don't need Acrobat, so it works on Mac and Linux as well as Windows. And, you can integrate PDF comments with your site's current commenting system. Our Comments skin will get you up and running. View our online example at http://www.pdfhacks.com/eno/skinned_comments/.

Skinning PDF, Adding Comments
Instead of using the template pdfskins_classic_php-1.0.zip, download pdfskins_classic_comments-1.0.zip. This Comments skin is the same as the php skin except it adds showannot.php and saveannot.php.

Skin a PDF with our comments template and move the results into a directory on your web server. Your server must have permission to write in this directory so that it can create and maintain comments. Point your web browser to this URL and the commenting frame should be visible on the right. Enter a comment into the field and click Add Comment. Your comment should appear above.

Our commenting script saves page comments in text files. To reduce the chance of a file access collision, it copies the current comments to a temporary file before appending a new comment. When it is done, it replaces the original comments file with the updated temporary file. Even so, if two users submit comments simultaneously, they still might collide. Consider adapting the script so that it stores comments in a database instead of a text file.

Saturday, March 22, 2008

How to use PDF Skins as Copy Protection?

By bursting your PDF into pages and then not making the full document available for download, you compel readers to return to your site when they desire your material. If this is your intent, you should also secure your pages against merging, so nobody can easily reassemble your pages into the original PDF document. Do this when bursting the document. For example:

pdftk full_doc.pdf burst encrypt_128bits owner_pw 23@#5dfa \

allow DegradedPrinting

Test our PHP-based hacks on your Windows machine by installing the Apache web server.

How to Change Colors and override the title?

You can add or change data in the doc_data.txt file, or you can pass additional, overriding data to pdfskins on the command line. This is most useful for changing the default colors used in the Classic skin. For example:

pdfskins full_doc.pdf -title "Great American Novel" -color1 #336600 \

-color2 white

In the Classic skin, color1 is the color of the header and color2 is the color around the upper-left logo. Alternatively, you can add or change these lines in doc_data.txt:

InfoKey: Color1

InfoValue: #336600

InfoKey: Color2

InfoValue: white

InfoKey: Title

InfoValue: Great American Novel

How to Skin the PDF?

First, install pdftk . Next, visit http://www.pdfhacks.com/skins/ and download pdfskins-1.1.zip. Unzip, and move pdfskins.exe to a convenient location, such as C:\Windows\system32\. On other platforms, compile pdfskins from the included source code. Just cd pdfskins-1.1 and run make.

Download a skin template from http://www.pdfhacks.com/skins/. The template pdfskins_classic_js uses client-side JavaScript to create the dynamic pieces. pdfskins_classic_php uses server-side PHP instead. Pick one and unzip it into a new directory:

unzip pdfskins_classic_js-1.1.zip

Copy your PDF document into this new directory and burst it into pages with pdftk. This also creates doc_data.txt, which reports on the document's title, metadata, and bookmarks:

pdftk full_doc.pdf burst

Finally, in this same directory, spin skins using pdfskins. It reads doc_data.txt, created earlier, for the document title and other data. Pass the PDF filename as the first argument, if you plan to make the full PDF document available for download. This first argument is used only for constructing the Download Full Document hyperlink. It can be a full or relative URL. Omit this filename, and this hyperlink will not be displayed.

pdfskins full_doc.pdf

Fire up your web browser and point it at index.html, located in the directory where you've been working. The portal should appear, showing the table of contents and graphic placeholders for your logo (logo.gif) and document cover thumbnail (thumb.gif). If you used the php or comments templates, the pages must be served to you by a PHP-enabled web server.

The PDF pages that make up our skinned PDF do not need to be linearized; nor does the web server require byte serving configuration . The only requirement is that the user has Adobe Reader configured to display PDF inside the browser, which is the default Reader configuration.


Tuesday, March 18, 2008

Split a PDF into pages and frame them in HTML, where the fun begins

In general, HTML files are called pages, while PDF files are called documents. By splitting a PDF document into PDF pages we shift it into HTML's paradigm where we now can program the document like a web site. Let's start with a basic document skin, which gives us a cool look and handy document navigation.

Classic skin has a number of nice built-in features:
  • Table of contents portal page based on PDF bookmarks
  • Navigation cluster for flipping through pages
  • Table of Contents navigation sidebar based on PDF bookmarks
  • A hyperlink to the full, unsplit PDF for download on each page
  • Convenient Email This Page link on each page
Test-drive our online version at http://www.pdfhacks.com/eno/. The HTML, JavaScript, and user interface icons are freely distributable under the GPL, so feel free to use them in your own templates.

How to create a PDF Table of Contents in HTML with pdftk and pdftoc?

First, download and install pdftk.
Pdftk can report on PDF data, including bookmarks. pdftoc converts this plain-text report into HTML. Visit http://www.pdfhacks.com/pdftoc/ and download pdftoc-1.0.zip. Unzip, and move pdftoc.exe to a convenient location, such as C:\Windows\system32\. On other platforms, build pdftoc from the source code.

Use pdftk to grab the bookmark data from your PDF, like so:

pdftk mydoc.pdf dump_data output mydoc_data.txt

Next, use pdftoc to convert this plain-text report into HTML:

pdftoc mydoc.pdf <> mydoc_toc.html

Alternatively, you can run these two steps together, like so:

pdftk mydoc.pdf dump_data | pdftoc mydoc.pdf > mydoc_toc.html

The first argument to pdftoc is the document location that you want pdftoc to use in its hyperlinks. The previous example assumes that mydoc.pdf and mydoc_toc.html will be in the same directory. You can also give a relative path to your PDF, like so:

pdftoc ../pdf/mydoc.pdf <> mydoc_toc.html

or a full URL:

pdftoc http://pdfhacks.com/pdf/mydoc.pdf <> mydoc_toc.html

Once readers enter the PDF, they can use its bookmarks for further navigation. To ensure they see your bookmarks, set your PDF to display them upon opening.

You can also add a download link on the web page that prompts the user to save the PDF on her local disk. As a courtesy to the user, mention the download file size, too.

Give web surfers an inviting HTML gateway into your PDF

When browsing the Web, I usually groan at the sight of a PDF link. You have probably experienced it, too. My research has brought me to this point where I must now download a large PDF before I can proceed. The problem isn't so much with the PDF file, but with my inability to gauge just how much this PDF might help me before I commit to downloading it.

The PDF author might have even gone to great lengths to ensure a good, online read, with nice, clear fonts, navigational bookmarks, and page-at-a-time byte serving for quick, random access. But I can't tell that from looking at this PDF link. Chances are that I'll click and wait, and wait. When it finally opens, I'll probably need to flip, page by page, through illegible text looking for a clue that this tome will help me somehow. I might never find out, especially because I have a dozen other possible lines of inquiry I am pursuing at the same time.

Don't let this happen to your online PDF. If your PDF has bookmarks, use this hack to create an HTML table of contents that hyperlinks every heading directly to its PDF page.

Thursday, March 13, 2008

How to Customize Html Hyperlink to PDF Pages?

Take readers directly to the information they seek.

You can use HTML hyperlinks, those famous filaments of the Web, to integrate PDF documents with HTML documents. A simple link to a PDF document is not enough, though, because a single PDF might hold hundreds of pages. It is like handing a haystack to somebody searching for a needle. The solution is to modify the HTML link so that it takes the reader directly to the PDF page of interest. This kind of seamless integration of HTML and PDF pages requires some groundwork.

To tailor a hyperlink's PDF destination, just add one or more of the suffixes listed in below to the href path.

Open the PDF to page number N (the first page is 1)
page=N

Display PDF bookmarks
pagemode=bookmarks

Display PDF thumbnails
pagemode=thumbs

Conceal PDF bookmarks and thumbnails
pagemode=none

Conceal the Acrobat scrollbars
scrollbar=false

Conceal the Acrobat toolbar
toolbar=false

These are glued together and appended to the href path using a special notation. The first suffix follows a hash mark. Each additional suffix follows an ampersand. These options are fully documented in PDF Open Parameters, located at http://partners.adobe.com/asn/acrobat/sdk/public/docs/PDFOpenParams.pdf.

For example, to open mydoc.pdf to page 17 and display its document bookmarks, the hyperlink href would look like this:

http://pdfvault.com/mydoc.pdf#page=17&pagemode=bookmarks


These special PDF hyperlinks do not work when you're using Internet Explorer and the PDF is on your local disk.

Save Display Settings in the PDF
You can also save these display settings in the PDF file. Whenever and however the PDF is opened, it will be displayed according to your settings.

How to Serve PDF Downloads with a PHP Script?

This next script enables you to serve PDF downloads. It is handy for when you want to make a single PDF available for both online reading and downloading. You can use its technique of using the Content-Type and Content-Disposition headers in any script that serves download-only PDF.

Download the script in http://ifile.it/5ul9ewn

If you have a PDF located at http://www.pdfvault.com/docs/mydoc.pdf and you copied the preceding script to http://www.
pdfvault.com/docs/pdfdownload.php, the URL http://www.pdfvault.com/docs/pdfdownload.php?fn=mydoc.pdf would prompt users to download mydoc.pdf to their computers.

How to Create Download-Only Folders Using .htaccess Files?

Do you have an entire directory of download-only PDFs on your web server? You can change that directory's .htaccess file so that visitors are always prompted to download their PDFs. The trick is to send suitable Content-Type and Content-Disposition HTTP headers to the clients.

This works on Apache and Zeus web servers that have their .htaccess features enabled. In your PDF directory, add a file named .htaccess that has these lines:




Sunday, March 9, 2008

Using Zip to prevent online PDF reading

Prevent your online PDF from appearing inside the browser.

Some PDF documents on the Web are intended for online reading, but most are intended for download and then offline reading or printing. You can prevent confusion by ensuring your readers get the Save As . . . dialog when they click your Download Now PDF link. Here are a few ways to do this.

Keep in mind that any online PDF can be downloaded. If your online PDF is hyperlinked to integrate with your web site, you should take precautions against these links being broken upon download.

One option is to use only absolute URLs throughout your PDF.

Another option is to set the Base URL of your PDF. In Acrobat 6, consult File >Document Properties >Advanced >Base URL. In Acrobat 5, consult File >Document Properties >Base URL.

Zip It Up
The quickest solution for a single PDF is to compress it into a zip file, which gives you a file that simply cannot be read online. This has the added benefit of reducing the download file size a little. The downside is that your readers must have a program to unzip the file. You should include a hyperlink to where they can download such a program (e.g., http://www.info-zip.org/pub/infozip/). Stay away from self-extracting executables, because they work on only a single platform.

You can also apply zip compression on the fly with your web server. Here is an example in PHP. Adjust the passthru argument so that it points to your local copy of zip, download it in
http://ifile.it/8njgeb2

If you have a PDF located at http://www.pdfhacks.com/docs/mydoc.pdf and you copied the preceding script to http://www.pdfhacks.com/docs/pdfzip.php, you could serve mydoc.pdf.zip with the URL http://www.pdfhacks.com/docs/pdfzip.php?fn=mydoc.pdf.

Preparing Server for PDF Online reading

Both Apache, Versions 1.3.17 and greater, and Microsoft IIS, Versions 3 and greater, should serve PDF pages on demand without additional configuration. The key to serving PDF pages on demand is byte range support by the web server. HTTP 1.1 describes byte range support (http://www.freesoft.org/CIE/RFC/2068/160.htm). Byte range support means that the client can request a specific range of bytes from the web server. Instead of serving the entire file, the server will send just those bytes.

The web server must indicate its support for byte ranges by sending the "Accept-Ranges: bytes" header in response to a PDF file request. Otherwise, Acrobat might not attempt page-at-a-time downloading. If you want to tell clients to not attempt page-at-a-time serving from your server, send the "Accept-Ranges: none" header instead.

Preparing the PDF for Online Reading

Serve PDF pages on demand and spare readers a long download.

Sometimes readers want to download the entire document; sometimes they want to read just a few pages. If a reader desires to read a single page from your PDF, she shouldn't be stuck downloading the entire document. A large document download will turn her away. The easiest solution is to configure your PDF and your web server for serving individual pages on request. An alternative is to use our PDF skins.

Prepare the PDF
To permit page-at-a-time delivery over the Web, a PDF must be linearized. Linearization organizes a PDF's internal structure so that a client can request the PDF resources it needs on a byte-by-byte basis. If the reader wants to see page 12, then the client requests only the data it needs to display page 12.

Test whether a PDF is linearized by opening it in Acrobat/Reader and viewing its document properties. Open File >Document Properties . . . >Description (Acrobat 6) or File >Document Properties >Summary (Acrobat 5). A linearized PDF shows Fast Web View: Yes.

The Xpdf project (http://www.foolabs.com/xpdf/) includes a command-line tool called pdfinfo that can tell you if a PDF is linearized. Pass your PDF to pdfinfo like so:

pdfinfo mydoc.pdf

pdfinfo will create a text report on-screen that says Optimized: Yes if your PDF is linearized. pdfinfo is free software.

To create a linearized PDF using Acrobat, first inspect your preferences. Select Edit >Preferences >General . . . and choose the General category (Acrobat 6) or the Options category (Acrobat 5). Place a checkmark next to Save As Optimizes for Fast Web View and click OK.

Open the PDF you want to linearize and then Save As... to the same filename. In Acrobat 6, you can change the PDF's compatibility level at the same time by selecting File >Reduce File Size instead of Save As.... Open the document properties to check that it worked.

If you ever make changes to the PDF in Acrobat and then simply File Save your PDF, it will no longer be linearized. You must use Save As... to ensure that your PDF remains linearized.

Ghostscript includes a command-line tool called pdfopt that can linearize PDF. To create a linearized PDF using pdfopt, invoke it from the command-line like so:

pdfopt
input.pdf output.linearized.pdf

Tuesday, March 4, 2008

How to Chain the PDF to the User's Machine?

Digital Rights Management (DRM) tools give you fine-grained control over how and when the reader can use your document. Typically, a reader downloads the full PDF, but he can't read it until he purchases a key. After he makes the purchase, a key is created that can open that PDF only on that computer. Some readers find this model too restricting.

DRM software vendors include Adobe (PDF Merchant), FileOpen Systems, Authentica, and SoftSeal. Their tools tend to be too expensive for the casual user. Consider partnering with a distributor or a self-publishing service

How to make a PDF Online Reading Only?

Another idea is to prevent the reader from ever downloading your PDF. A single PDF can always be downloaded. So, burst your document into individual PDF pages and then wrap them in our HTML skins. When you burst the PDF, supply additional security settings for the output pages so that the reader won't be able to easily reassemble them. Make sure you installed pdftk first. For example:

pdftk doc.pdf burst encrypt_128bits owner_pw 23@#5dfa allow DegradedPrinting

After integrating your document into your web site, you can employ user accounts, passwords, and other common security devices for enforcing access permissions.

Skinned PDFs are vulnerable to being copied from your site using a recursive HTTP robot. The result would be an exact copy of your site's pages (PDF and HTML) on the user's local machine.

Low Tech PDF Copy Protection: Print Editions

Control how far your document can wander by making it difficult to copy.

A large document represents a great deal of work, and PDF is a good way to distribute large documents. Sometimes, it is too good. Perhaps your readers are paying customers, and you don't want them to make copies for their friends. Perhaps you want people to read your work only from your web site, not from a downloaded copy. These kinds of controls go beyond standard PDF security.

Copying and sharing print editions of your document would be too much trouble for most readers. Your price for this security is the cost and trouble of production and shipping. However, readers might prefer a print edition, in which case you are also adding value to your work. Print editions are vulnerable to being converted to unsecured PDF by scanning and OCR.