Package org.apache.pdfbox.text
Class PDFMarkedContentExtractor
- java.lang.Object
-
- org.apache.pdfbox.contentstream.PDFStreamEngine
-
- org.apache.pdfbox.text.LegacyPDFStreamEngine
-
- org.apache.pdfbox.text.PDFMarkedContentExtractor
-
public class PDFMarkedContentExtractor extends LegacyPDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
-
-
Field Summary
Fields Modifier and Type Field Description private java.util.Map<java.lang.String,java.util.List<TextPosition>>
characterListMapping
private java.util.Deque<PDMarkedContent>
currentMarkedContents
private java.util.List<PDMarkedContent>
markedContents
private boolean
suppressDuplicateOverlappingText
-
Constructor Summary
Constructors Constructor Description PDFMarkedContentExtractor()
Instantiate a new PDFTextStripper object.PDFMarkedContentExtractor(java.lang.String encoding)
Constructor.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
beginMarkedContentSequence(COSName tag, COSDictionary properties)
Called when a marked content group beginsvoid
endMarkedContentSequence()
Called when a a marked content group endsjava.util.List<PDMarkedContent>
getMarkedContents()
protected void
processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page.private boolean
within(float first, float second, float variance)
This will determine of two floating point numbers are within a specified variance.void
xobject(PDXObject xobject)
-
Methods inherited from class org.apache.pdfbox.text.LegacyPDFStreamEngine
computeFontHeight, processPage, showGlyph
-
Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, applyTextAdjustment, beginText, decreaseLevel, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator
-
-
-
-
Field Detail
-
suppressDuplicateOverlappingText
private final boolean suppressDuplicateOverlappingText
- See Also:
- Constant Field Values
-
markedContents
private final java.util.List<PDMarkedContent> markedContents
-
currentMarkedContents
private final java.util.Deque<PDMarkedContent> currentMarkedContents
-
characterListMapping
private final java.util.Map<java.lang.String,java.util.List<TextPosition>> characterListMapping
-
-
Constructor Detail
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor() throws java.io.IOException
Instantiate a new PDFTextStripper object.- Throws:
java.io.IOException
-
PDFMarkedContentExtractor
public PDFMarkedContentExtractor(java.lang.String encoding) throws java.io.IOException
Constructor. Will apply encoding-specific conversions to the output text.- Parameters:
encoding
- The encoding that the output will be written in.- Throws:
java.io.IOException
-
-
Method Detail
-
within
private boolean within(float first, float second, float variance)
This will determine of two floating point numbers are within a specified variance.- Parameters:
first
- The first number to compare to.second
- The second number to compare to.variance
- The allowed variance.
-
beginMarkedContentSequence
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
Description copied from class:PDFStreamEngine
Called when a marked content group begins- Overrides:
beginMarkedContentSequence
in classPDFStreamEngine
- Parameters:
tag
- indicates the role or significance of the sequenceproperties
- optional properties
-
endMarkedContentSequence
public void endMarkedContentSequence()
Description copied from class:PDFStreamEngine
Called when a a marked content group ends- Overrides:
endMarkedContentSequence
in classPDFStreamEngine
-
xobject
public void xobject(PDXObject xobject)
-
processTextPosition
protected void processTextPosition(TextPosition text)
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPosition
in classLegacyPDFStreamEngine
- Parameters:
text
- The text to process.
-
getMarkedContents
public java.util.List<PDMarkedContent> getMarkedContents()
-
-