This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: RnjnHUlgmhQFxQgJ7W34WEKJG8rzUS8nArT_fFKT5pA
Cover

How to Crop, Split, Remove Pages From PDFs With Java and PDFBox

Written by @alexadam | Published on 2023/6/1

TL;DR
If you don’t like to read large amounts of text on a screen and prefer (printing on) paper, this article will show you how to use Java to automate some useful PDF manipulations.

If you don’t like to read large amounts of text on a screen and prefer (printing on) paper, this article will show you how to use Java to automate some useful PDF manipulations. To save paper and ink/toner, my usual workflow is to remove extra pages from PDFs, then crop the text area to make it bigger, and to remove white margins. Then I merge 2 pages per sheet and I split the doc. into multiple, smaller files, to make it easier to print on both sides if the printer doesn’t support full-duplex. I chose Java and PDFBox because they are extremely fast on large files and they have a ton of options and possibilities to integrate with other projects.

Create a new Java project with Maven

Let's start by creating a new Java project, named pdf_utils, with Maven:

mvn archetype:generate \
    -DgroupId=com.pdf.pdf_utils \
    -DartifactId=pdf_utils \
    -DarchetypeArtifactId=maven-archetype-quickstart \
    -DarchetypeVersion=1.4 \
    -DinteractiveMode=false

Then, open the pdf_utils/pom.xml file and add a dependency to PDFBox, in the dependencies section:

<dependencies>
   ...
    <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>pdfbox</artifactId>
      <version>2.0.27</version>
    </dependency>
    ...
</dependencies>

Also change the target & source compiler versions:

 <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>17</source>
          <target>17</target>
        </configuration>
      </plugin>
</plugins>

Then rename the generated src/main/java/com/pdf/pdf_utils/App.java class to PDFUtils

How to crop a PDF

We will now develop a Java function that crops each page within a PDF document and saves the cropped content in a new PDF file. The cropping coordinates, specified in millimeters, will serve as inputs to the function.

  1. Add helper functions to convert between mm <-> units

  2. Convert mm to units

  3. Extract the doc's file name

  4. Crop each page

  5. Save the result & append "-cropped" to the ouyput file's name

public String cropPDFMM(float x, float y, float width, float height, String srcFilePath) throws IOException {
    // helper functions to convert between mm <-> units
    Function<Float, Float> mmToUnits = (Float a) -> a / 0.352778f;
    Function<Float, Float> unitsToMm = (Float a) -> a * 0.352778f;

    // convert mm to units
    float xUnits = mmToUnits.apply(x);
    float yUnits = mmToUnits.apply(y);
    float widthUnits = mmToUnits.apply(width);
    float heightUnits = mmToUnits.apply(height);

    // extract the doc's file name
    File srcFile = new File(srcFilePath);
    String fileName = srcFile.getName();
    int dotIndex = fileName.lastIndexOf('.');
    String fileNameWithoutExtension =  (dotIndex == -1) ? fileName : fileName.substring(0, dotIndex);

    // crop each page
    PDDocument doc = PDDocument.load(srcFile);
    int nrOfPages = doc.getNumberOfPages();
    PDRectangle newBox = new PDRectangle(
            xUnits,
            yUnits,
            widthUnits,
            heightUnits);
    for (int i = 0; i < nrOfPages; i++) {
        doc.getPage(i).setCropBox(newBox);
    }

    // save the result & append -cropped to the file name
    File outFile = new File(fileNameWithoutExtension + "-cropped.pdf");
    doc.save(outFile);
    doc.close();
    return outFile.getCanonicalPath();
}

Let's test cropPDFMM by calling it from the main function:

public static void main( String[] args )
    {
    String srcFilePath = "/Users/user/.../file.pdf";
    PDFUtils app = new PDFUtils();

    try {
        ///// crop pdf
        float x = 18f;
        float y = 20f;
        float width = 140f;
        float height = 210f;
        String resultFilePath = app.cropPDFMM(x, y, width, height, srcFilePath);

        System.out.println( "Done!" );
    } catch (Exception e) {
        System.out.println(e);
    }
}

You should see a file named file-cropped.pdf in the current directory.

How to remove pages from a PDF

To remove specific pages from a document, we can utilize an array of integer ranges. Each range consists of a start page and an end page ([startPage1, endPage1, startPage2, endPage2, ...]). The function iterates through each page of the document and checks if the page number falls outside of any of the specified ranges in the array. If a page is not within any range, it is appended to a new document.

  1. Add a helper function to test if a page is within a range

  2. Test each page number -> append it to a temp. doc

  3. Save the temp. doc. -> overwrite the input file

public void removePages(String srcFilePath, Integer[] pageRanges) throws IOException {
     // a helper function to test if a page is within a range
     BiPredicate<Integer, Integer[]> pageInInterval = (Integer page, Integer[] allPages) -> {
         for (int j = 0; j < allPages.length; j+=2) {
             int startPage = allPages[j];
             int endPage = allPages[j+1];
             if (page >= startPage-1 && page < endPage) {
                 return true;
             }
         }
         return false;
     };

     File srcFile = new File(srcFilePath);
     PDDocument pdfDocument = PDDocument.load(srcFile);
     PDDocument tmpDoc = new PDDocument();

     // test if a page is within a range
     // if not, append the page to a temp. doc.
     for (int i = 0; i < pdfDocument.getNumberOfPages(); i++) {
         if (pageInInterval.test(i, pageRanges)) {
             continue;
         }
         tmpDoc.addPage(pdfDocument.getPage(i));
     }

     // save the temporary doc.
     tmpDoc.save(new File(srcFilePath));
     tmpDoc.close();
     pdfDocument.close();
 }

Let's test it by calling removePages in the main function:

 ///// remove pages
app.removePages(resultFilePath, new Integer[] {1, 21, 376, 428});

It will overwrite the input (cropped) file.

How to split a PDF

We will now introduce a function that enables the splitting of a PDF into multiple separate PDFs, with each resulting file containing a specified number of pages. The function expects two inputs: the path to the source PDF document and the desired number of pages in each split file.

  1. Extract source file's name

  2. for each nrOfPages

  3. append them to a temporary document

  4. save the temp doc with the source file's name + index

public void splitPDF(String srcFilePath, int nrOfPages) throws IOException {
    // extract file's name
    File srcFile = new File(srcFilePath);
    String fileName = srcFile.getName();
    int dotIndex = fileName.lastIndexOf('.');
    String fileNameWithoutExtension =  (dotIndex == -1) ? fileName : fileName.substring(0, dotIndex);

    PDDocument pdfDocument = PDDocument.load(srcFile);

    // extract every nrOfPages to a temporary document
    // append an index to its name and save it
    for (int i = 1; i < pdfDocument.getNumberOfPages(); i+=nrOfPages) {
        Splitter splitter = new Splitter();

        int fromPage = i;
        int toPage = i+nrOfPages;
        splitter.setStartPage(fromPage);
        splitter.setEndPage(toPage);
        splitter.setSplitAtPage(toPage - fromPage );

        List<PDDocument> lst = splitter.split(pdfDocument);

        PDDocument pdfDocPartial = lst.get(0);
        File f = new File(fileNameWithoutExtension + "-" + i + ".pdf");
        pdfDocPartial.save(f);
        pdfDocPartial.close();
    }
    pdfDocument.close();
}

Here is the full main() function:

public static void main( String[] args ){
    String srcFilePath = "/Users/user/pdfs/file.pdf";
    PDFUtils app = new PDFUtils();

    try {
         ///// crop pdf
        float x = 18f;
        float y = 20f;
        float width = 140f;
        float height = 210f;
        String resultFilePath = app.cropPDFMM(x, y, width, height, srcFilePath);

        ///// remove pages
        app.removePages(resultFilePath, new Integer[] {1, 21, 376, 428});
        
        ///// split pages
        app.splitPDF(resultFilePath, 20);

        System.out.println( "Done!" );
    } catch (Exception e) {
        System.out.println(e);
    }
}

Merge 2 pages per sheet (2up)

This function is inspired from https://stackoverflow.com/questions/12093408/pdfbox-merge-2-portrait-pages-onto-a-single-side-by-side-landscape-page.

  1. Create a temporary document

  2. Iterate over the pages of the original doc. -> get the left & right pages

  3. Create a new "output" page with the right dimensions

  4. Append the left page at (0,0)

  5. Append the right page, translated to (left page's width, 0)

  6. Save the temp. doc -> overwrite the source file

public void mergePages(String srcFilePath) throws IOException {

    // SOURCE: https://stackoverflow.com/questions/12093408/pdfbox-merge-2-portrait-pages-onto-a-single-side-by-side-landscape-page
    File srcFile = new File(srcFilePath);
    PDDocument pdfDocument = PDDocument.load(srcFile);
    PDDocument outPdf = new PDDocument();

    for (int i = 0; i < pdfDocument.getNumberOfPages(); i+=2) {
        PDPage page1 = pdfDocument.getPage(i);
        PDPage page2 = pdfDocument.getPage(i+1);
        PDRectangle pdf1Frame = page1.getCropBox();
        PDRectangle pdf2Frame = page2.getCropBox();
        PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));

        // Create output page with calculated frame and add it to the document
        COSDictionary dict = new COSDictionary();
        dict.setItem(COSName.TYPE, COSName.PAGE);
        dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
        dict.setItem(COSName.CROP_BOX, outPdfFrame);
        dict.setItem(COSName.ART_BOX, outPdfFrame);
        PDPage newP = new PDPage(dict);
        outPdf.addPage(newP);

        // Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
        LayerUtility layerUtility = new LayerUtility(outPdf);
        PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdfDocument, page1);
        PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdfDocument, page2);

        AffineTransform afLeft = new AffineTransform();
        layerUtility.appendFormAsLayer(newP, formPdf1, afLeft, "left" + i);
        AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
        layerUtility.appendFormAsLayer(newP, formPdf2, afRight, "right" + i);
    }

    outPdf.save(srcFile);
    outPdf.close();
    pdfDocument.close();
}

Update main() to test it:

...
 ///// 2 pages per sheet
app.mergePages(resultFilePath);
...

Here is the full list of imports:

package com.pdf.pdf_utils;

import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.multipdf.LayerUtility;
import org.apache.pdfbox.multipdf.Splitter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;

import java.awt.geom.AffineTransform;
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.function.BiPredicate;
import java.util.function.Function;

The source code is available here.

Also published here.

[story continues]


Written by
@alexadam
Creative Coder

Topics and
tags
java|pdfbox|pdf|edit-a-pdf|coding|programming|software-development|software-engineering
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: RnjnHUlgmhQFxQgJ7W34WEKJG8rzUS8nArT_fFKT5pA