com.lowagie.text.pdf.parser
Class PdfTextExtractor

java.lang.Object
  extended by com.lowagie.text.pdf.parser.PdfTextExtractor

public class PdfTextExtractor
extends java.lang.Object

Extracts text from a PDF file.

Since:
2.1.4

Field Summary
private  SimpleTextExtractingPdfContentStreamProcessor extractionProcessor
          The processor that will extract the text.
private  PdfReader reader
          The PdfReader that holds the PDF file.
 
Constructor Summary
PdfTextExtractor(PdfReader reader)
          Creates a new Text Extractor object.
 
Method Summary
private  byte[] getContentBytesForPage(int pageNum)
          Gets the content stream of a page.
 java.lang.String getTextFromPage(int page)
          Gets the text from a page.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

reader

private final PdfReader reader
The PdfReader that holds the PDF file.


extractionProcessor

private final SimpleTextExtractingPdfContentStreamProcessor extractionProcessor
The processor that will extract the text.

Constructor Detail

PdfTextExtractor

public PdfTextExtractor(PdfReader reader)
Creates a new Text Extractor object.

Parameters:
reader - the reader with the PDF
Method Detail

getContentBytesForPage

private byte[] getContentBytesForPage(int pageNum)
                               throws java.io.IOException
Gets the content stream of a page.

Parameters:
pageNum - the page number of page you want get the content stream from
Returns:
a byte array with the content stream of a page
Throws:
java.io.IOException

getTextFromPage

public java.lang.String getTextFromPage(int page)
                                 throws java.io.IOException
Gets the text from a page.

Parameters:
page - the page number of the page
Returns:
a String with the content as plain text (without PDF syntax)
Throws:
java.io.IOException

Hosted by Hostbasket