TextPage

This class represents text and images shown on a document page. All MuPDF document types are supported.

The usual ways to create a textpage are DisplayList.get_textpage() and Page.get_textpage(). Because there is a limited set of methods in this class, there exist wrappers in the Page class, which incorporate creating an intermediate text page and then invoke one of the following methods. The last column of this table shows these corresponding Page methods.

For a description of what this class is all about, see Appendix 2.

Method

Description

page getText or search method

extractText()

extract plain text

“text”

extractTEXT()

synonym of previous

“text”

extractBLOCKS()

plain text grouped in blocks

“blocks”

extractWORDS()

all words with their bbox

“words”

extractHTML()

page content in HTML format

“html”

extractXHTML()

page content in XHTML format

“xhtml”

extractXML()

page text in XML format

“xml”

extractDICT()

page content in dict format

“dict”

extractJSON()

page content in JSON format

“json”

extractRAWDICT()

page content in dict format

“rawdict”

extractRAWJSON()

page content in JSON format

“rawjson”

search()

Search for a string in the page

Page.search()

Class API

class TextPage
extractText()
extractTEXT()

Return a string of the page’s complete text. The text is UTF-8 unicode and in the same sequence as specified at the time of document creation.

Return type

str

extractBLOCKS()

Textpage content as a list of text lines grouped by block. Each list items looks like this:

(x0, y0, x1, y1, "lines in blocks", block_no, block_type)

The first four entries are the block’s bbox coordinates, block_type is 1 for an image block, 0 for text. block_no is the block sequence number.

For an image block, its bbox and a text line with image meta information is included – not the image data itself.

This is a high-speed method with just enough information to output plain text in desired reading sequence.

Return type

list

extractWORDS()

Textpage content as a list of single words with bbox information. An item of this list looks like this:

(x0, y0, x1, y1, "word", block_no, line_no, word_no)

Everything wrapped in spaces is treated as a “word” with this method.

This is a high-speed method which e.g. allows extracting text from within a given rectangle.

Return type

list

extractHTML()

Textpage content in HTML format. This version contains complete formatting and positioning information. Images are included (encoded as base64 strings). You need an HTML package to interpret the output in Python. Your internet browser should be able to adequately display this information, but see Controlling Quality of HTML Output.

Return type

str

extractDICT()

Textpage content as a Python dictionary. Provides same information detail as HTML. See below for the structure.

Return type

dict

extractJSON()

Textpage content in JSON format. Created by json.dumps(TextPage.extractDICT()). It is included for backlevel compatibility. You will probably use this method ever only for outputting the result to some file. The method detects binary image data and converts them to base64 encoded strings on JSON output.

Return type

str

extractXHTML()

Textpage content in XHTML format. Text information detail is comparable with extractTEXT(), but also contains images (base64 encoded). This method makes no attempt to re-create the original visual appearance.

Return type

str

extractXML()

Textpage content in XML format. This contains complete formatting information about every single character on the page: font, size, line, paragraph, location, color, etc. Contains no images. You probably need an XML package to interpret the output in Python.

Return type

str

extractRAWDICT()

Textpage content as a Python dictionary – technically similar to extractDICT(), and it contains that information as a subset (including any images). It provides additional detail down to each character, which makes using XML obsolete in many cases. See below for the structure.

Return type

dict

extractRAWJSON()

Textpage content in JSON format. Created by json.dumps(TextPage.extractRAWDICT()). You will probably use this method ever only for outputting the result to some file. The method detects binary image data and converts them to base64 encoded strings on JSON output.

Return type

str

search(needle, quads=False)

(Changed in v1.18.2)

Search for string and return a list of found locations.

Parameters
  • needle (str) – the string to search for. Upper and lower cases will all match.

  • quads (bool) – return quadrilaterals instead of rectangles.

Return type

list

Returns

a list of Rect or Quad objects, each surrounding a found needle occurrence. The search string may contain spaces, it may therefore happen, that its parts are located on different lines. In this case, more than one rectangle (resp. quadrilateral) are returned. (Changed in v1.18.2) The method now supports dehyphenation, so it will find “method” even if it was hyphenated in two parts “meth-” and “od” across two lines. The two returned rectangles will exclude the hyphen in this case.

Note

Overview of changes in v1.18.2:

  1. The hit_max parameter has been removed: all hits are always returned.

  2. The rect parameter of the TextPage is now respected: only text inside this area is examined. Only characters with fully contained bboxes are considered. The wrapper method Page.search_for() correspondingly supports a clip parameter.

  3. Words hyphenated at the end of a line are now found.

  4. Overlapping rectangles in the same line are now automatically joined. We assume that such separations are an artifact created by multiple marked content groups, containing parts of the same search needle.

Example Quad versus Rect: when searching for needle “pymupdf”, then the corresponding entry will either be the blue rectangle, or, if quads was specified, the quad Quad(ul, ur, ll, lr).

_images/img-quads.jpg
rect

The rectangle associated with the text page. This either equals the rectangle of the creating page or the clip parameter of Page.get_textpage() and text extration / searching methods.

Note

The output of text searching and most text extractions is restricted to this rectangle. (X)HTML and XML output will however always extract the full page.

Dictionary Structure of extractDICT() and extractRAWDICT()

_images/img-textpage.png

Page Dictionary

Key

Value

width

width of the clip rectangle (float)

height

height of the clip rectangle (float)

blocks

list of block dictionaries

Block Dictionaries

Block dictionaries come in two different formats for image blocks and for text blocks.

(Changed in v1.18.0) – new dict key number, the block number.

Image block:

Key

Value

type

1 = image (int)

bbox

block / image rectangle, formatted as tuple(fitz.Rect)

number

block number (int) (0-based)

ext

image type (str), as file extension, see below

width

original image width (int)

height

original image height (int)

colorspace

colorspace.n (int)

xres

resolution in x-direction (int)

yres

resolution in y-direction (int)

bpc

bits per component (int)

image

image content (bytes or bytearray)

Possible values of the “ext” key are “bmp”, “gif”, “jpeg”, “jpx” (JPEG 2000), “jxr” (JPEG XR), “png”, “pnm”, and “tiff”.

Note

  1. In some error situations, all of the above values may be zero or empty. So, please be prepared to digest items like:

    {"type": 1, "bbox": (0.0, 0.0, 0.0, 0.0), ..., "image": b""}
    
  2. TextPage and corresponding method Page.get_text() are available for all document types. Only for PDF documents, methods Document.get_page_images() / Page.get_images() offer some overlapping functionality as far as image lists are concerned. But both lists may or may not contain the same items. Any differences are most probably caused by one of the following:

    • “Inline” images (see page 352 of the Adobe PDF References) of a PDF page are contained in a textpage, but not in Page.get_images().

    • Image blocks in a textpage are generated for every image location – whether or not there are any duplicates. This is in contrast to Page.get_images(), which will contain each image only once.

    • Images mentioned in the page’s object definition will always appear in Page.get_images() 1. But it may happen, that there is no “display” command in the page’s contents (erroneously or on purpose). In this case the image will not appear in the textpage.

Text block:

Key

Value

type

0 = text (int)

bbox

block rectangle, formatted as tuple(fitz.Rect)

number

block number (int) (0-based)

lines

list of text line dictionaries

Line Dictionary

Key

Value

bbox

line rectangle, formatted as tuple(fitz.Rect)

wmode

writing mode (int): 0 = horizontal, 1 = vertical

dir

writing direction (list of floats): [x, y]

spans

list of span dictionaries

The value of key “dir” is a unit vetor and should be interpreted as follows:

  • x: positive = “left-right”, negative = “right-left”, 0 = neither

  • y: positive = “top-bottom”, negative = “bottom-top”, 0 = neither

The values indicate the “relative writing speed” in each direction, such that x2 + y2 = 1. In other words dir = [cos(beta), sin(beta)], where beta is the writing angle relative to the x-axis.

Span Dictionary

Spans contain the actual text. A line contains more than one span only, if it contains text with different font properties.

(Changed in version 1.14.17) Spans now also have a bbox key (again). (Changed in version 1.17.6) Spans now also have an origin key.

Key

Value

bbox

span rectangle, formatted as tuple(fitz.Rect)

origin

tuple coordinates of the first character’s bottom left point

font

font name (str)

ascender

ascender of the font (float)

descender

descender of the font (float)

size

font size (float)

flags

font characteristics (int)

color

text color in sRGB format (int)

text

(only for extractDICT()) text (str)

chars

(only for extractRAWDICT()) list of character dictionaries

(New in version 1.16.0): “color” is the text color encoded in sRGB (int) format, e.g. 0xFF0000 for red. There are functions for converting this integer back to formats (r, g, b) (PDF with float values from 0 to 1) sRGB_to_pdf(), or (R, G, B), sRGB_to_rgb() (with integer values from 0 to 255).

(New in v1.18.5): “ascender” and “descender” are font properties, provided relative to fontsize 1. Note that descender is a negative value. The following picture shows the relationship to other values and properties.

_images/img-asc-desc.png

These numbers may be used to compute the minimum height of a character (or span) – as opposed to the standard height provided in the “bbox” values (which actually represents the line height). The following code recalculates the span bbox to have a height of fontsize exactly fitting the text inside:

>>> a = span["ascender]
>>> d = span["descender"]
>>> r = fitz.Rect(span["bbox"])
>>> o = fitz.Point(span["origin"])  # its y-value is the baseline
>>> r.y1 = o.y - span["size"] * d / (a - d)
>>> r.y0 = r.y1 - span["size"]
>>> # r now is a rectangle of height 'fontsize'

Caution

The above calculation may deliver a larger height! This may e.g. happen for OCR-ed documents, where the risk of all sorts of text artifacts is high. MuPDF tries to come up with a reasonable bbox height, independently from the fontsize found in the PDF. So please ensure that the height of span["bbox"] is larger than span["size"].

Note

You may request PyMuPDF to do all of the above automatically by executing fitz.TOOLS.set_small_glyph_heights(True). This sets a global parameter so that all subsequent text searches and text extractions are based on reduced glyph heights, where meaningful.

The following shows the original span rectangle in red and the rectangle with re-computed height in blue.

_images/img-span-rect.png

“flags” is an integer, interpreted as a bit field like this:

  • bit 0: superscripted (20)

  • bit 1: italic (21)

  • bit 2: serifed (22)

  • bit 3: monospaced (23)

  • bit 4: bold (24)

Test these characteristics like so:

>>> if flags & 2**1: print("italic")
>>> # etc.

Character Dictionary for extractRAWDICT()

We are currently providing the bbox in rect_like format. In a future version, we might change that to quad_like. This image shows the relationship between items in the following table: textpagechar

Key

Value

origin

tuple coordinates of the character’s bottom left point

bbox

character rectangle, formatted as tuple(fitz.Rect)

c

the character (unicode)

Footnotes

1

Image specifications for a PDF page are done in a page’s (sub-) dictionary, called “/Resources”. Resource dictionaries can be inherited from the page’s parent object (usually the catalog). The PDF creator may e.g. define one /Resources on file level, naming all images and all fonts ever used by any page. In this case, Page.get_images() and Page.get_fonts() will always return the same lists for all pages.