The Python gfx module

1.gfx

All functionality of pdf2swf, swftools' PDF to SWF converter, is also exposed by the Python module "gfx". gfx contains a PDF parser (based on xpdf) and a number of rendering backends. In particular, it can extract text from PDF pages, create bitmaps from them, or convert PDF files to SWF. The latter functionality is similar to what is offered by swftools' (http://www.swftools.org) pdf2swf utility, however more powerful- You can also create individual SWF files from single pages of the PDF or mix pages from different PDFs.

1.1 Compiling gfx and installing

To install gfx, you first need to download and uncompress one of the archives at http://www.swftools.org/download.html. You then basically have two options:

You can build the Python module using setup.py
You can build it "manually" by using make

To do the former, all that should be required is

		Code listing 1.1
		python setup.py build python setup.py install

This is the preferred way. If the above gives you any trouble or you prefer make, the following will also create the Python module:

		Code listing 1.2
		./configure make # substitute the following path with your correct python installation: cp lib/python/*.so /usr/lib/python2.4/site-packages/

You can test whether the python module was properly installed by doing

		Code listing 1.3
		python -c 'import gfx'

Once the module has been properly installed, you can start to work with it, see next section.

1.2 Reading a PDF file

Reading PDF files in done using the open() call. Once the document has been opened, you can query the resulting object for some information about the PDF:

		Code listing 1.4
		#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") print "Author:", doc.getInfo("author") print "Subject:", doc.getInfo("subject") print "PDF Version:", doc.getInfo("version")

Using getInfo, You can query the following fields:

title, subject, keywords, author, creator, producer, creationdate, moddate, linearized, tagged, encrypted, oktoprint, oktocopy, oktochange, oktoaddnotes, version

Depending on the PDF file, not all these fields may contain useful information.

Some PDF files may be protected, or even password encrypted. You recognize protected files by the fact that doc.getInfo("encrypted") return "yes". If additionally doc.getInfo("oktocopy") is set to "no", then the file has copy protection enabled, which means that the gfx module won't allow you to extract information from it- extraction of pages (see below) will raise an exception.

If the PDF file is password encrypted, you need the password do display the file. You can pass the password to the open function by appending it to the filename, using '|' as seperator:

		Code listing 1.5
		#!/usr/bin/python import gfx doc = gfx.open("pdf", "protecteddocument.pdf\|mysecretpassword")

1.3 Reading an Image or SWF file

Reading image files or SWF files is done analogously. You only need to pass a different filetype specifier to the open() function:

		Code listing 1.6
		#!/usr/bin/python import gfx doc1 = gfx.open("image", "myimage.png") doc2 = gfx.open("swf", "flashfile.swf")

You can use all objects opened with gfx.open() in the same way. In particular, you can extract pages from them (as described in the next section), and render those pages to any kind of output device. (Notice that for image files, the number of pages in the document is always 1)

1.4 Extracting pages from a (PDF/SWF/Image) file

Once the document has been properly opened, you can start working with the content, i.e., the individual pages. You can extract a page from a file using the getPage() function. The resulting Page object gives you additional information about the file. getPage() expects the page number, which starts at 1 for the first page.

The following code lists all pages in a file, along with their size:

		Code listing 1.7
		#!/usr/bin/python import gfx doc = gfx.open("pdf", "document.pdf") for pagenr in range(1,doc.pages+1): page = doc.getPage(pagenr) print "Page", pagenr, "has dimensions", page.width, "x", page.height

Note: The size of pages can vary in PDF documents. Don't make the common mistake of querying only the first page for its dimensions and using that for all other pages.

1.5 Rendering pages to bitmaps

The gfx module contains a number of rendering backends. The most interesting is probably the ImageList renderer, which creates images from pages. The following code extracts the first page of a PDF document as an image:

Code listing 1.8

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
img = gfx.ImageList()
img.setparameter("antialise", "1") # turn on antialising
page1 = doc.getPage(1)
img.startpage(page1.width,page1.height)
page1.render(img)
img.endpage()
img.save("thumbnail80x80.png")

There are a number of pitfalls to be aware of, here:

The width and height of the thumbnail must be the same as the page (page.width, page.height). You can specify sizes smaller (or larger) than that, which will cause the page to be clipped or extended, but not scaled. If you want to scale the page, you can use the multiply option of the ImageList (which allows you to scale the page up by an integer value), the zoom option of the PDF parser (which is the same as the DPI, and 72 by default), and allows you to scale the image to any size, keeping the aspect ratio. (Also, you can using page.getImage() instead of using ImageList, which, however, will only get you a raw imagestring)
The 'save' function of ImageList will only create PNG files.
If you rendered more than one page, the save() function might create several files- one for each page. If the filename passed to save() is "image.png", then the files will be named "image.1.png", "image.2.png" etc.

1.6 Extracting text from PDF files

Using the PlainText device, you can extract fulltext from PDF files. The following code snippet demonstrates this behaviour:

Code listing 1.9

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
text = gfx.PlainText()
for pagenr in range(1,doc.pages+1):
    page = doc.getPage(pagenr)
    text.startpage(page.width, page.height)
    page.render(text)
    text.endpage()
text.save("document_fulltext.txt")

If you want to extract text from images, or have broken PDF files (i.e., PDF files where the fonts don't correctly reference Unicode characters, and hence text can't be extracted properly the "normal" way), you should use the OCR device instead. In the code above, substitute the call to gfx.PlainText() with the following:

		Code listing 1.10
		... gfx.setparameter("zoom", "400") text = gfx.OCR() ...

As you can see, the OCR device behaves just like any other device. Internally, it will generate images out of the pages that it's asked to process, and perform OCR (Optical character recognition) on them. You should use the "zoom" parameter to scale up the images that OCR operates on, for better results (in this example, the images are scaled up by 400%).

1.7 Rendering pages to SWF files

As the gfx module derives from pdf2swf, of course it can also convert PDF files to SWF files. The code needed for this is similar to the previous examples:

Code listing 1.11

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
swf = gfx.SWF()
for pagenr in range(1,doc.pages+1):
    page = doc.getPage(pagenr)
    swf.startpage(page.width, page.height)
    page.render(swf)
    swf.endpage()
swf.save("document.swf")

With gfx.SWF device (and with pdf2swf, too), you have a number of options for how the SWF content should be created:

Render everything in the same way as in the PDF- shapes will be converted to shapes, text to text, and bitmaps to bitmaps. (pdf2swf without any options)
Render as much as possible to bitmaps, but keep text as text and links as links. (pdf2swf -O1) In gfx, this is done by passing the "poly2bitmap" parameter to the module.
Render everything to bitmaps. Only links will be preserved. (pdf2swf -O2) In gfx, this is done by passing the "bitmap" parameter to the module.

It's important that you set all parsing related parameters before loading the PDF file, as most of the optimization is done during the loading process:

		Code listing 1.12
		#!/usr/bin/python import gfx gfx.setparameter("bitmap", "1") # or "poly2bitmap" doc = gfx.open("pdf", "document.pdf") ...

Parsing related parameters are: bitmap, poly2bitmap, bitmapfonts, fonts, fontdir, languagedir.

1.8 Putting more than one input page on one SWF page

You don't need to start a new output page for every input page you get. Therefore, you can e.g. put pairs of two pages beside each other:

Code listing 1.13

#!/usr/bin/python
import gfx
doc = gfx.open("pdf", "document.pdf")
swf = gfx.SWF()

for pagenr in range(doc.pages/2):
    page1 = doc.getPage(pagenr*2+1)
    page2 = doc.getPage(pagenr*2+2)
    swf.startpage(page1.width+page2.width, max(page1.height,page2.height))
    page1.render(swf,move=(0,0))
    page2.render(swf,move=(page1.width,0), clip=(page1.width,0,page1.width+page2.width,page2.height))
    swf.endpage()

if doc.pages%2:
    # for an odd number of pages, render final page
    page = doc.getPage(doc.pages)
    swf.startpage(page.width,page.height)
    page.render(swf)
    swf.endpage()

swf.save("document.swf")

In this code, we used the move and clip parameters of the render function to shift the second page to the right, and then clip it to its bounding box.

1.9 Parsing (PDF/Image/SWF) content yourself

If none of the supplied output devices (PlainText, ImageList, SWF) is doing what you need, you can also process the PDF content yourself. The gfx module gives you an easy way to do it, by translating the usually very complex PDF file contents into a number of very simple drawing operations. In order to pass those operations to Python, you need the PassThrough output device, together with a custom class:

Code listing 1.14

import gfx
class MyOutput:
    def setparameter(key,value):
	print "setparameter",key,value
    def startclip(outline):
	print "startclip",outline
    def endclip():
	print "endclip"
    def stroke(outline, width, color, capstyle, jointstyle, miterLimit):
	print "stroke",outline
    def fill(outline, color):
	print "fill",outline
    def fillbitmap(outline, image, matrix, colortransform):
	print "fillbitmap",outline
    def fillgradient(outline, gradient, gradienttype, matrix):
	print "fillgradient",outline
    def addfont(font):
	print "addfont"
    def drawchar(font, glyph, color, matrix):
	print "drawchar"
    def drawlink(outline, url):
	print "drawlink", outline, url

doc = gfx.open("pdf", "document.pdf")
output = gfx.PassThrough(MyOutput())
doc.getPage(1).render(output)

The above is the minimum of functions the class passed to "PassThrough" must have in order to be able to process all PDF content. If any of the functions are not defined, a error message will be printed, however the rendering process will not be aborted.

gfx