What to do when a client sends a PDF?

Read Time:4 Minute, 19 Second

PDF files are the Achilles heel of translators.

What’s a PDF, anyway?

The PDF (portable document format) was created in the 90s by Adobe as a way for files to keep their original formatting in all OS or software. In other words, it transforms text – say, a Word document, into an image so that it can’t be changed.

This can be useful when you want to send a quote and you don’t want the customer to change the prices. However, it can also be a headache if you need to import it into your CAT Tool.


How do CAT Tools handle PDFs?

Some CAT Tools like Trados or MemoQ handle PDF files quite well if the quality is good.

Basically, when you import a PDF into Trados, for example, it recognises the text through OCR (I’ll get back to OCR), transforming it into an editable format. You then translate it and it creates another PDF when it’s exported. Seems simple, but it can be tricky.

It happens that PDF files aren’t usually the pinnacle of quality.

With the evolution of software, almost every file can be converted into a PDF, including that photo you took with your NOKIA 6630 15 years ago of a birth certificate that fell in the washing machine.

If you try to open a bad quality PDF in a CAT tool, you’re in for a treat. The Optical Character Recognition (OCR) software CAT tools use to “read” the document will not be able to do so. Therefore, chances are that you end up with either a document with more tags than you can handle or random characters that you can’t decipher.


So, what to do when a client sends you a PDF?

For starters, open the PDF and pay attention to: text type and area for the quote, formatting (if it has many tables, images containing text, balloons, decorations, confetti, etc.) and the quality itself.

Now, if it is just text with no images, no tables and good quality, it is possible that our CAT Tool is going to manage to do a good job. Still, make sure to open it with Word (Word also does its best to open PDF. Right-click the file -> open with -> choose Word -> pray) and if it’s good, it’s always better to import the Word document into a CAT. You can also do this in Word to count the words, if repetitions are not an issue.

If the PDF’s quality is bad or the formatting is complicated, there are two options: manually typing it into Word (if it is a small PDF of a bad picture); or using an external OCR software.

The use of external OCR software like Abbyy, OmniPage, Soda, or even Adobe, can make your life much easier. Usually these are quite intuitive. You only have to tell them which file types you want to work with and the languages they are in (it is very important to select the correct language because different languages use different characters.) This software isn’t free (far from it!), however, there are several free versions (not as good, but good) all over the internet. Some are, of course, better than others, and others steal your email address. I can leave you with a suggestion.

Going back to the process: try to use external OCR as much as possible. Unless, of course, you’re an adventurer who likes to take risks. So, you import your PDF into an OCR, read it, recognise it and export it into Word, and you’ll probably end up with one or two hundred text boxes if the file had a complicated formatting. If PDF files are the Achilles heel of translators, text boxes are the arrows. Here’s what may happen:

  • Your CAT will read some text boxes and not others, God knows why;
  • Languages have different sizes, so text boxes will be too big or too small in the final document;
  • You will have text boxes inside tables. This gives you grey hair.

To avoid text boxes, export your recognised document in “plain text”. This will remove everything that is not text from the document and it will create a Word document with no formatting at all, resulting in a CAT Tool’s favourite meal.

Then, format it. Formatting can be a tedious process, but, if PDFs are too complicated, you can always charge for this if you want. The basic idea of formatting is to grab that plain text Word document and turn it into a copy (as close as you can) of the original document, keeping the formatting as simple as possible to avoid tags and text boxes.

What to avoid in formatting:

  • Different fonts;
  • Shapes;
  • Text boxes;
  • Lots of spaces (use tab or, preferably, text align);
  • Manual line breaks (shift+enter);
  • Automatic tables of contents.

What to use in formatting:

  • Same font;
  • Same spacing between paragraphs;
  • Page breaks (highly recommended);
  • Normal margins;
  • Same page size (usually A4).

After creating your comprehensive close copy of the original, feed it into your CAT Tool, translate, export and you should have no surprises.

Remember: surprises make you lose time and we all know what time is.