Monday, November 26, 2007

How to extract the keyword of your PDF documents?

Knowing your PDF keywords is important to write excerpts and descriptions of your documents. Large PDF document would make it hard for you to decide the best keyword for your PDF document.
Follow these procedures

1. Convert your PDF into a text file which would ease the keyword analysis, download xpdf.

2. Unzip, and copy pdftotext.exe to any folder you want. Macintosh OS X users can download a pdftotext installer here.

3. Run pdftotext from the command line, the command format of pdftext is the following

pdftotext input.pdf output.txt

4. Download the keyword extractor here, the downloaded file is 'kw_index-1.0.zip

5. Extract the file, use the following command format
kw_catcher {window size} {report style} {text input filename}


Window size= too small will make you miss the keyword, too high would result in high number of noise words (and, if, or....), try 12 as default, adjust accordingly.
Report Style= fill in with 'frequency'
text input filename= input file, the converted text file.

For example,
kw_catcher 12 frequency sample.txt

2 comments:

Document Security said...

PDF file is a self-contained cross-platform document. It is a common format to which scanned documents are saved. These are meant to be accurate versions of source material, viewable across different operating systems and program packages. Thanks a lot...

Anonymous said...

Pretty useless. .. dont trust this post