Why Text Is The Hardest Part Of Editing PDFs
Unlike Word documents, PDFs don't store text in 'sentences' or 'paragraphs.' Explore the underlying architecture of the PDF format to understand why text manipulation is a complex game of glyphs, coordinates, and embedded fonts.
Background
PDFs support many features and capabilities, one of which is their ability to display text. PDFs (Portable Document Format) were originally designed to be a "digital paper" standard, a way to ensure that a document looks exactly the same on any screen or printer. To achieve this rigid consistency, the format treats text very differently than a standard word processor like Microsoft Word or Google Docs. But something we take for granted can surprisingly be so deceptively difficult to handle when you look underneath the hood.
In this article, we'll discuss why text is the hardest part of editing PDFs and its underlying mechanics. And what makes it notoriously difficult compared to other file formats. We'll talk about it in a semi-technical way, but you don't need to be a programmer to understand the main points.
- PDF text is stored as a series of precise positioning commands, not as editable characters in a continuous flow
- Mechanics such as graphics state machine, affine transformations, embedded fonts, non-Latin languages make handling text complex.
- Most professional PDF libraries are commercial because text editing requires handling dozens of edge cases correctly
- Texts are quite similar to dates, simple on the surface but complex when coded underneath
The Mechanics
PDFs are designed to be a fixed, final format for sharing documents. They are not meant to be easily editable like a Word document or Google Doc. The PDF specification itself is complex and was created with the primary goal of preserving the visual layout and fidelity of documents across different platforms and devices. This focus on visual consistency has made text editing a secondary concern in the design of PDFs, which is why it is often the most difficult aspect to work with when trying to edit a PDF file.
Graphics State Machine
PDF instructions are often laid out in a way that resembles a state machine, where the rendering engine processes a series of commands that manipulate the graphics state.
A typical text sequence (generally) follows a strict procedural flow/instructions (OPS or operators) that goes something like this: BT (Begin Text) > Tf (Set Font and Size) > Tm (Set Text Matrix) > Tj (Show Text) > ET (End Text).
Suppose the PDF instructions are go to position (100, 200), set font to "Helvetica", set font size to 12, and then show the text "Hello World". Then move the cursor by 50 units.
Repeat this three times, then flush the text. Then go to position (230, 310) and apply a different font and font size again, move the cursor by 20 units, set the char spacing by 10 units, then show text "This is a PDF". Then flush the text.
This state machine is not exclusive to just text but also applies to content objects like images and vector graphics with specific instructions like clip path, fill, stroke and more. Can you see how this can get complicated very quickly? Because each command depends on the state left by the previous command, a single mistake can cascade and affect the entire text rendering process. Each of these commands affects the state of the rendering engine, and they must be executed in the correct order to produce the desired visual output. One misalignment in the state machine can cause the text to render incorrectly or not at all, making editing a delicate process that requires a deep understanding of how the PDF rendering engine works.
Affine Transformations
PDFs use affine transformations to position and orient text on the page.
In the PDF specification, the array [a b c d e f] is mapped to a 3x3 matrix as follows:
| a b 0 |
| c d 0 |
| e f 1 |
In this system, a and d control scaling (horizontal and vertical), b and c manage skewing and rotation, and e and f handle the translation (the actual X and Y position).
This mathematical structure means text can be rotated, scaled, or skewed in any direction.
When editing a PDF, you cannot simply "type" new characters; you must calculate these transformations to ensure the new text aligns perfectly with the original’s visual orientation.
This means that text can be rotated, scaled, skewed, or translated in any direction.
When you try to edit text in a PDF, you have to take into account these transformations to ensure that the edited text is placed correctly and maintains the same visual appearance as the original.
Embedded Fonts
PDFs often embed fonts to ensure that the document looks the same on all devices. More often than not, the embedded font is a subset of the full font, containing only the characters used in the document. For example, if you use a font like "Open Sans" in your PDF, the PDF may only embed the specific characters you used, such as "O", "p", "e", "n", "S", "a", "s" and adding a new character that is not in the embedded font (like "b") would require either finding a suitable replacement character from the existing font subset or embedding a new font that includes the missing character. If the embedded font is missing or not properly referenced, it can cause text editing to become difficult or even impossible without first resolving the font issue.
Complex Non-English Languages
In most languages, we can generally assume that a character/glyph can be represented by an empty square. For Latin and CJK, characters are generally "discrete", represented by a box and do not change shape based on their neighbors. In languages such as Chinese, Japanese, and Korean, for example, a single character can be "placed" in a single empty square. The "Empty Square" (Em Square) in typography is what we call the "bounding box" or "advance width."
The problem arises with complex languages like Arabic and Hebrew. Scripts such as Arabic, Hebrew, and Indic languages (like Devanagari) reject this box model. These languages rely on contextual shaping and ligatures, where the visual appearance of a character changes entirely based on its neighbors. For example, in Arabic, a single letter has different shapes depending on whether it starts, joins, or ends a word. These languages often require special handling for bidirectional text, ligatures, and contextual shaping.
Let's use Devanagari as an example, consider the word for "Hindi": हिन्दी
- Logical Order (The Boxes): [ह] + [ि] + [न] + [्] + [द] + [ी]
- Rendered Reality: The vowel ि (short 'i') physically moves to the left of the ह (h), and the न (n) merges into the द (d) to form a single combined shape (ligature).
Why Most PDF Tools Are Paid
Because of the edge cases mentioned above, high-quality PDF manipulation is expensive to build and maintain. Most robust libraries are locked behind steep enterprise paywalls. If you look at the landscape across different programming languages, the "pro" options are almost always commercial:
| Language | Commercial/Enterprise Options | Open Source Alternatives |
|---|---|---|
| C# / .NET | Aspose.PDF, IronPDF | QuestPDF (Layout), iText (AGPL) |
| Python | ReportLab Plus | PyMuPDF, PDFMiner.six |
| Node.js | Syncfusion, PSPDFKit | PDF.js (Viewing), PDF-lib (Basic Edit) |
| Java | DynamicPDF, iText Pro | Apache PDFBox, LibreOffice |
A great number of programming PDF libraries are locked behind paywalls too. Examples such as Syncfusion, Aspose, IronPDF, DynamicPDF, to name a few. These libraries often have robust text editing capabilities, but they come at a cost that can be prohibitive for many users and developers. These cost are often justified by the complexity of implementing text editing features in PDFs, which require handling a wide range of edge cases and ensuring that the edited text integrates seamlessly with the existing content and formatting of the PDF.
Text = Dates?
I like to think of PDF text as similar to handling dates in programming. Dates seem simple on the surface, just a day, a month, and a year. But when you dig into the details, you realize there are many complexities and edge cases that make them difficult to handle correctly. These include time zones, leap years, daylight saving time, different calendar systems, and more.
PDF text is the same. A word on a page looks simple, but underneath it involves precise coordinate positioning, font embedding, character encoding, and script-shaping rules. Just as a date library that ignores time zones will break on half the planet, a PDF text editor that ignores these rules will produce corrupted or misaligned text in ways that are hard to predict.
I hope you learned something new. This post is a semi-culmination of my own learnings from diving deep into how PDF text actually works underneath the hood. It is one of those topics where the more you dig, the more you appreciate why so few tools get it right, and why the ones that do often come with a price tag. If anything, I hope it gives you a better sense of what is happening when you click on a text block in a PDF editor, and why sometimes things break in ways that seem completely unpredictable.
Tips
- Use a tool that handles the complexity for you: The Edit PDF tool handles all the low-level PDF mechanics automatically. Font embedding, and coordinate transforms are all managed behind the scenes, so you can edit text without worrying about breaking the document structure.
- Match fonts when editing: Because embedded fonts may only contain the characters from the original document, clicking a nearby text block before typing helps you match the font in use. If a character is missing from the embedded font, the editor will fall back to a supported font like Helvetica.
- Stick to supported fonts: When editing text, it's generally recommended to stick to the fonts that are supported by the PDF editor. Unsupported embedded fonts may not render correctly or could cause issues with the document's layout.
- Cover uneditable text with a rectangle: If a text block was created as vector graphics rather than real text, you cannot edit the characters directly. Place a white rectangle over it, then add a new text block on top. This is the same approach used in professional PDF workflows.
- Password-protected PDF? You will need the document password to open and edit it. The editor will prompt you when you upload the file.
More articles
How To Edit A PDF Online For Free No Sign Up With Privacy In Mind
Learn how to edit PDF files online for free privately with our easy-to-use tools.
How To Edit Text In A PDF Without Adobe Acrobat For Free
Edit text in any PDF for free, directly in your browser, without needing Adobe Acrobat or any install.