Thursday, January 24, 2008

The Procedure

First, set your PDF's logical page numbering [Hack #62] to match your document's page numbering. Then, use pdftk to dump this information into a text file, like so:

pdftk mydoc.pdf dump_data output mydoc.data.txt

Next, convert your PDF to plain text with pdftotext:

pdftotext mydoc.pdf mydoc.txt

Create a keyword list from mydoc.txt using kw_catcher, like so:

kw_catcher 12 keywords_only mydoc.txt > mydoc.kw.txt

Edit mydoc.kw.txt to remove duds and add missing keywords. At present, only one keyword is allowed per line. If two or more keywords are adjacent in mydoc.txt, our page_refs program will assemble them into phrases.

Now pull all these together to create a text index using page_refs:

page_refs mydoc.txt mydoc.kw.txt mydoc.data.txt > mydoc.index.txt

Finally, create a PDF from mydoc.index.txt using enscript and ps2pdf:

enscript --columns 2 --font 'Times-Roman@10' \

--header '|INDEX' --header-font 'Times-Bold@14' \

--margins 54:54:36:54 --word-wrap --output - mydoc.index.txt \

| ps2pdf - mydoc.index.pdf

0 comments: