First, set your PDF's logical page numbering [Hack #62] to match your document's page numbering. Then, use pdftk to dump this information into a text file, like so:
pdftk mydoc.pdf dump_data output mydoc.data.txt
Next, convert your PDF to plain text with pdftotext:
pdftotext mydoc.pdf mydoc.txt
Create a keyword list from mydoc.txt using kw_catcher, like so:
kw_catcher 12 keywords_only mydoc.txt > mydoc.kw.txt
Edit mydoc.kw.txt to remove duds and add missing keywords. At present, only one keyword is allowed per line. If two or more keywords are adjacent in mydoc.txt, our page_refs program will assemble them into phrases.
Now pull all these together to create a text index using page_refs:
page_refs mydoc.txt mydoc.kw.txt mydoc.data.txt > mydoc.index.txt
Finally, create a PDF from mydoc.index.txt using enscript and ps2pdf:
enscript --columns 2 --font 'Times-Roman@10' \
--header '|INDEX' --header-font 'Times-Bold@14' \
--margins 54:54:36:54 --word-wrap --output - mydoc.index.txt \
| ps2pdf - mydoc.index.pdf
pdftk mydoc.pdf dump_data output mydoc.data.txt
Next, convert your PDF to plain text with pdftotext:
pdftotext mydoc.pdf mydoc.txt
Create a keyword list from mydoc.txt using kw_catcher, like so:
kw_catcher 12 keywords_only mydoc.txt > mydoc.kw.txt
Edit mydoc.kw.txt to remove duds and add missing keywords. At present, only one keyword is allowed per line. If two or more keywords are adjacent in mydoc.txt, our page_refs program will assemble them into phrases.
Now pull all these together to create a text index using page_refs:
page_refs mydoc.txt mydoc.kw.txt mydoc.data.txt > mydoc.index.txt
Finally, create a PDF from mydoc.index.txt using enscript and ps2pdf:
enscript --columns 2 --font 'Times-Roman@10' \
--header '|INDEX' --header-font 'Times-Bold@14' \
--margins 54:54:36:54 --word-wrap --output - mydoc.index.txt \
| ps2pdf - mydoc.index.pdf
0 comments:
Post a Comment