The Python gfx module


>gfx<  

1.gfx

All functionality of pdf2swf, swftools' PDF to SWF converter, is also exposed by the Python module "gfx". gfx contains a PDF parser (based on xpdf) and a number of rendering backends. In particular, it can extract text from PDF pages, create bitmaps from them, or convert PDF files to SWF. The latter functionality is similar to what is offered by swftools' (http://www.swftools.org) pdf2swf utility, however more powerful- You can also create individual SWF files from single pages of the PDF or mix pages from different PDFs.

1.1  Compiling gfx and installing 

To install gfx, you first need to download and uncompress one of the archives at http://www.swftools.org/download.html. You then basically have two options: To do the former, all that should be required is

Code listing 1.1

python setup.py build
python setup.py install

This is the preferred way. If the above gives you any trouble or you prefer make, the following will also create the Python module:

Code listing 1.2

./configure
make
# substitute the following path with your correct python installation:
cp lib/python/*.so /usr/lib/python2.4/site-packages/

You can test whether the python module was properly installed by doing

Code listing 1.3

python -c 'import gfx'

Once the module has been properly installed, you can start to work with it, see next section.

1.2  Reading a PDF file 

Reading PDF files in done using the open() call. Once the document has been opened, you can query the resulting object for some information about the PDF:

Code listing 1.4

#!/usr/bin/python
import gfx

doc = gfx.open("pdf", "document.pdf")

print "Author:", doc.getInfo("author")
print "Subject:", doc.getInfo("subject")
print "PDF Version:", doc.getInfo("version")

Using getInfo, You can query the following fields:

title, subject, keywords, author, creator, producer, creationdate, moddate, linearized, tagged, encrypted, oktoprint, oktocopy, oktochange, oktoaddnotes, version

Depending on the PDF file, not all these fields may contain useful information.

Some PDF files may be protected, or even password encrypted. You recognize protected files by the fact that doc.getInfo("encrypted") return "yes". If additionally doc.getInfo("oktocopy") is set to "no", then the file has copy protection enabled, which means that the gfx module won't allow you to extract information from it- extraction of pages (see below) will raise an exception.

If the PDF file is password encrypted, you need the password do display the file. You can pass the password to the open function by appending it to the filename, using '|' as seperator:

Code listing 1.5

#!/usr/bin/python
import gfx

doc = gfx.open("pdf", "protecteddocument.pdf|mysecretpassword")

1.3  Reading an Image or SWF file 

Reading image files or SWF files is done analogously. You only need to pass a different filetype specifier to the open() function:

Code listing 1.6

#!/usr/bin/python
import gfx

doc1 = gfx.open("image", "myimage.png")
doc2 = gfx.open("swf", "flashfile.swf")

You can use all objects opened with gfx.open() in the same way. In particular, you can extract pages from them (as described in the next section), and render those pages to any kind of output device. (Notice that for image files, the number of pages in the document is always 1)

1.4  Extracting pages from a (PDF/SWF/Image) file 

Once the document has been properly opened, you can start working with the content, i.e., the individual pages. You can extract a page from a file using the getPage() function. The resulting Page object gives you additional information about the file. getPage() expects the page number, which starts at 1 for the first page.

The following code lists all pages in a file, along with their size:

Code listing 1.7

#!/usr/bin/python
import gfx

doc = gfx.open("pdf", "document.pdf")
for pagenr in range(1,doc.pages+1):
    page = doc.getPage(pagenr)
    print "Page", pagenr, "has dimensions", page.width, "x", page.height

Note: The size of pages can vary in PDF documents. Don't make the common mistake of querying only the first page for its dimensions and using that for all other pages.

1.5  Rendering pages to bitmaps 

The gfx module contains a number of rendering backends. The most interesting is probably the ImageList renderer, which creates images from pages. The following code extracts the first page of a PDF document as an image:

Code listing 1.8

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
img = gfx.ImageList()
img.setparameter("antialise", "1") # turn on antialising
page1 = doc.getPage(1)
img.startpage(page1.width,page1.height)
page1.render(img)
img.endpage()
img.save("thumbnail80x80.png")

There are a number of pitfalls to be aware of, here:

1.6  Extracting text from PDF files 

Using the PlainText device, you can extract fulltext from PDF files. The following code snippet demonstrates this behaviour:

Code listing 1.9

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
text = gfx.PlainText()
for pagenr in range(1,doc.pages+1):
    page = doc.getPage(pagenr)
    text.startpage(page.width, page.height)
    page.render(text)
    text.endpage()
text.save("document_fulltext.txt")

If you want to extract text from images, or have broken PDF files (i.e., PDF files where the fonts don't correctly reference Unicode characters, and hence text can't be extracted properly the "normal" way), you should use the OCR device instead. In the code above, substitute the call to gfx.PlainText() with the following:

Code listing 1.10

...
gfx.setparameter("zoom", "400")
text = gfx.OCR()
...

As you can see, the OCR device behaves just like any other device. Internally, it will generate images out of the pages that it's asked to process, and perform OCR (Optical character recognition) on them. You should use the "zoom" parameter to scale up the images that OCR operates on, for better results (in this example, the images are scaled up by 400%).

1.7  Rendering pages to SWF files 

As the gfx module derives from pdf2swf, of course it can also convert PDF files to SWF files. The code needed for this is similar to the previous examples:

Code listing 1.11

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
swf = gfx.SWF()
for pagenr in range(1,doc.pages+1):
    page = doc.getPage(pagenr)
    swf.startpage(page.width, page.height)
    page.render(swf)
    swf.endpage()
swf.save("document.swf")

With gfx.SWF device (and with pdf2swf, too), you have a number of options for how the SWF content should be created: It's important that you set all parsing related parameters before loading the PDF file, as most of the optimization is done during the loading process:

Code listing 1.12

#!/usr/bin/python
import gfx
gfx.setparameter("bitmap", "1") # or "poly2bitmap"
doc = gfx.open("pdf", "document.pdf")
...

Parsing related parameters are: bitmap, poly2bitmap, bitmapfonts, fonts, fontdir, languagedir.

1.8  Putting more than one input page on one SWF page 

You don't need to start a new output page for every input page you get. Therefore, you can e.g. put pairs of two pages beside each other:

Code listing 1.13

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
swf = gfx.SWF()

for pagenr in range(doc.pages/2):
    page1 = doc.getPage(pagenr*2+1)
    page2 = doc.getPage(pagenr*2+2)
    swf.startpage(page1.width+page2.width, max(page1.height,page2.height))
    page1.render(swf,move=(0,0))
    page2.render(swf,move=(page1.width,0), clip=(page1.width,0,page1.width+page2.width,page2.height))
    swf.endpage()

if doc.pages%2:
    # for an odd number of pages, render final page
    page = doc.getPage(doc.pages)
    swf.startpage(page.width,page.height)
    page.render(swf)
    swf.endpage()

swf.save("document.swf")

In this code, we used the move and clip parameters of the render function to shift the second page to the right, and then clip it to its bounding box.

1.9  Parsing (PDF/Image/SWF) content yourself 

If none of the supplied output devices (PlainText, ImageList, SWF) is doing what you need, you can also process the PDF content yourself. The gfx module gives you an easy way to do it, by translating the usually very complex PDF file contents into a number of very simple drawing operations. In order to pass those operations to Python, you need the PassThrough output device, together with a custom class:

Code listing 1.14

import gfx
class MyOutput:
    def setparameter(key,value):
	print "setparameter",key,value
    def startclip(outline):
	print "startclip",outline
    def endclip():
	print "endclip"
    def stroke(outline, width, color, capstyle, jointstyle, miterLimit):
	print "stroke",outline
    def fill(outline, color):
	print "fill",outline
    def fillbitmap(outline, image, matrix, colortransform):
	print "fillbitmap",outline
    def fillgradient(outline, gradient, gradienttype, matrix):
	print "fillgradient",outline
    def addfont(font):
	print "addfont"
    def drawchar(font, glyph, color, matrix):
	print "drawchar"
    def drawlink(outline, url):
	print "drawlink", outline, url

doc = gfx.open("pdf", "document.pdf")
output = gfx.PassThrough(MyOutput())
doc.getPage(1).render(output)

The above is the minimum of functions the class passed to "PassThrough" must have in order to be able to process all PDF content. If any of the functions are not defined, a error message will be printed, however the rendering process will not be aborted.
  gfx