Examining a page¶
Pages are dictionaries¶
In PDFs, the main data structure is the dictionary, a key-value data
structure much like a Python dict
or attrdict
. The major difference is
that the keys can only be names, while values can be any type, including
other dictionaries.
PDF dictionaries are represented as pikepdf.Dictionary
, and names
are of type pikepdf.Name
. A page is just another dictionary, with a
few required fields that give it special status as a page.
A pikepdf.Name
that is, usually, an ASCII-encoded string beginning with
“/” followed by a capital letter.
In [1]: from pikepdf import Pdf
In [2]: example = Pdf.open('../tests/resources/congress.pdf')
In [3]: page1 = example.pages[0]
In [4]: page1
Out[4]:
<pikepdf.Dictionary(type_="/Page")({
"/Contents": pikepdf.Stream(stream_dict={
"/Length": 50
}, data=<...>),
"/MediaBox": [ 0, 0, 200, 304 ],
"/Parent": <reference to /Pages>,
"/Resources": {
"/XObject": {
"/Im0": pikepdf.Stream(stream_dict={
"/BitsPerComponent": 8,
"/ColorSpace": "/DeviceRGB",
"/Filter": [ "/DCTDecode" ],
"/Height": 1520,
"/Length": 192956,
"/Subtype": "/Image",
"/Type": "/XObject",
"/Width": 1000
}, data=<...>)
}
},
"/Type": "/Page"
})>
Item and attribute notation¶
Dictionary keys may be looked up using keys (page1['/MediaBox']
) or
attributes (page1.MediaBox
). The two conventions are equivalent.
In [5]: page1.MediaBox
Out[5]: pikepdf.Array([ 0, 0, 200, 304 ])
In [6]: page1['/MediaBox']