Automate Your PDF Tasks: Scripts, APIs and Batch Processing to Save Hours Every Week
Automate Your PDF Tasks: Scripts, APIs and Batch Processing to Save Hours Every Week
One Wednesday morning in January 2024, Marie, an accountant at a Lyon-based SME, opened her computer with a sinking feeling. Before her: 347 PDF invoices to process individually. Extract data, rename files according to a precise format, merge those from the same client, compress everything, then archive in the right folders. She knew this task would take her all day, like every beginning of the month for three years.
Two weeks later, Marie arrived at the office with a smile. The same pile of 347 invoices awaited her, but this time, she launched a simple Python script. Seven minutes later, everything was done. Extraction, renaming, merging, compression, archiving: everything executed automatically while she had her coffee. Marie had just reclaimed eight hours of her life, every month, for the rest of her career.
This transformation is nothing magical. Marie simply discovered PDF task automation. And if her monthly day of drudgery can become seven minutes of automatic execution, what about your own repetitive tasks?
Why Automate Your PDF Tasks: The Hidden Costs of Manual Processing
The True Price of Manual Work
Every professional handles dozens, even hundreds of PDFs every week. Invoices, contracts, reports, quotes, pay slips... These seemingly trivial tasks accumulate a considerable cost that most companies dramatically underestimate.
A study conducted in 2024 among 500 European companies reveals staggering figures: an office employee spends on average 6.5 hours per week on PDF manipulation tasks that could be automated. Over a year, that represents 338 hours, more than eight weeks of productive work lost per person. For a company of 50 employees, the annual cost easily exceeds 500,000 euros in wasted salaries.
Beyond time, manual processing generates costly errors. A mistyped filename, a forgotten page during merging, a document sent to the wrong recipient... These inevitable human errors create cascading complications: payment delays, contractual disputes, regulatory non-compliance. A single error in an invoice can cost thousands of euros in correction time and deteriorated client relationships.
Warning Signs That Call for Automation
Certain situations literally scream for automation. If you recognize yourself in one of these scenarios, you're probably wasting precious time:
The repetitive copy-paste syndrome: You manually extract the same information from dozens of PDFs to report them in a spreadsheet. Each extraction takes two to three minutes, and you do it twenty times a day. This mind-numbing task not only kills your productivity but also your motivation.
Mass renaming hell: You receive files with generic names ("document.pdf", "scan001.pdf") and must rename them according to precise nomenclature. After the tenth file, your brain starts melting, and errors accumulate.
The monthly merger nightmare: Every month-end, you compile dozens of individual reports into a consolidated document. Open, copy, paste, check order, repeat ad nauseam for hours.
Industrial watermarking: You must add a "CONFIDENTIAL" or "DRAFT" watermark to hundreds of documents. Open each PDF, add the watermark, save... Three clicks multiplied by 200 files equals half a day wasted.
Pre-send compression: Your CRM limits attachment size. So you manually compress each PDF before upload, one by one, praying not to degrade quality to the point of making text illegible.
These tasks share three fatal characteristics: they're repetitive, time-consuming and absolutely not creative. Exactly the type of work that computers execute brilliantly while humans can focus on truly value-added activities.
The Spectacular Return on Investment
PDF task automation offers one of the best returns on productivity investment. Unlike many optimization projects that require months of deployment, PDF automation can be operational in a few hours.
Let's take the concrete example of a Parisian notary office. Before automation, three employees collectively spent fifteen hours per week preparing client files: merging documents, adding page numbers, watermarking, compression. A freelance developer created an automated system in two days of work (cost: 1,600 euros). Result: fifteen hours recovered each week, or 780 hours per year. At an average hourly rate of 35 euros, the annual savings reach 27,300 euros. The return on investment? Achieved in three weeks.
Beyond direct financial gains, automation frees up precious intellectual capital. Employees no longer waste their cognitive energy on mind-numbing repetitive tasks. Their job satisfaction improves, turnover decreases, and they can finally focus on stimulating missions that truly exploit their skills.
Python: The Swiss Army Knife of PDF Automation
Why Python Dominates PDF Automation
Python has established itself as the reference language for PDF task automation, and it's no accident. Its clear and intuitive syntax allows even non-developers to create functional scripts after a few hours of learning. An accountant, a lawyer or an administrative assistant can master sufficient basics to automate their own tasks.
The Python ecosystem abounds with libraries specialized in PDF manipulation. PyPDF2, pypdf, ReportLab, pdfplumber, PyMuPDF (fitz)... Each excels in specific domains, offering a complete toolbox for practically all imaginable operations on PDFs.
PyPDF2: The Essential Library
PyPDF2 represents the ideal entry point into PDF automation with Python. This mature and stable library allows performing the most common operations with disarming simplicity.
Installation and first script in 60 seconds:
# Installation via pip
pip install PyPDF2
# First script: merge two PDFs
from PyPDF2 import PdfMerger
merger = PdfMerger()
merger.append('report_january.pdf')
merger.append('report_february.pdf')
merger.write('report_Q1.pdf')
merger.close()
print("Merge completed!")
This six-line script accomplishes what would manually take two minutes per merge. Multiply it by fifty monthly merges, and you just saved one hundred minutes per month.
Practical case: Automate specific page extraction
Imagine having to systematically extract pages 3 to 7 from dozens of standardized reports. Here's how to automate this task:
from PyPDF2 import PdfReader, PdfWriter
import os
def extract_specific_pages(source_folder, destination_folder, first_page, last_page):
"""
Extracts specified pages from all PDFs in a folder.
Args:
source_folder: Path to folder containing original PDFs
destination_folder: Path where to save extracts
first_page: Number of first page to extract (starts at 0)
last_page: Number of last page to extract (included)
"""
# Create destination folder if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)
files_processed = 0
# Browse all PDF files in source folder
for filename in os.listdir(source_folder):
if filename.endswith('.pdf'):
full_path = os.path.join(source_folder, filename)
# Open source PDF
reader = PdfReader(full_path)
writer = PdfWriter()
# Extract requested pages
for page_num in range(first_page, min(last_page + 1, len(reader.pages))):
writer.add_page(reader.pages[page_num])
# Save new PDF
output_name = f"extract_{filename}"
output_path = os.path.join(destination_folder, output_name)
with open(output_path, 'wb') as output_file:
writer.write(output_file)
files_processed += 1
print(f"✓ Processed: {filename}")
print(f"\n{files_processed} files processed successfully!")
# Usage
extract_specific_pages(
source_folder='./complete_reports',
destination_folder='./extracted_reports',
first_page=2, # Page 3 (index starts at 0)
last_page=6 # Page 7
)
This script transforms a multi-hour task into a few seconds of execution. An HR department that monthly extracts pay slips from a 500-page consolidated PDF thus recovers four hours of work each month.
Automatic rotation based on text orientation
You receive scans with pages in all directions? Automate their straightening:
from PyPDF2 import PdfReader, PdfWriter
def correct_pdf_orientation(input_file, output_file):
"""
Detects and automatically corrects page orientation.
"""
reader = PdfReader(input_file)
writer = PdfWriter()
for page_number, page in enumerate(reader.pages):
# Get page dimensions
width = float(page.mediabox.width)
height = float(page.mediabox.height)
# If width > height, page is probably in landscape
if width > height:
# 90° rotation to return to portrait
page.rotate(90)
print(f"Page {page_number + 1}: rotation applied (landscape → portrait)")
writer.add_page(page)
with open(output_file, 'wb') as file:
writer.write(file)
print(f"Corrected PDF saved: {output_file}")
# Usage
correct_pdf_orientation('mixed_scan.pdf', 'corrected_scan.pdf')
pdfplumber: Intelligent Data Extraction
When it comes to extracting text and structured data (tables, forms), pdfplumber surpasses PyPDF2 with remarkable precision.
Practical case: Automatically extract invoice data
Marie's scenario from the introduction becomes reality with this script:
import pdfplumber
import pandas as pd
import os
import re
def extract_invoice_data(invoices_folder):
"""
Automatically extracts key information from all invoices
and compiles them into an Excel file.
"""
invoice_data = []
for filename in os.listdir(invoices_folder):
if filename.endswith('.pdf'):
invoice_path = os.path.join(invoices_folder, filename)
with pdfplumber.open(invoice_path) as pdf:
# Extract text from first page
page = pdf.pages[0]
text = page.extract_text()
# Extract with regular expressions
invoice_number = re.search(r'No\.\s*:\s*(\d+)', text)
date = re.search(r'Date\s*:\s*(\d{2}/\d{2}/\d{4})', text)
amount = re.search(r'Total\s*:\s*([\d\s,]+)\$', text)
client = re.search(r'Client\s*:\s*(.+)', text)
# Extract tables (invoice lines)
tables = page.extract_tables()
line_count = len(tables[0]) if tables else 0
# Compile data
invoice_data.append({
'File': filename,
'Number': invoice_number.group(1) if invoice_number else 'N/A',
'Date': date.group(1) if date else 'N/A',
'Amount': amount.group(1).replace(' ', '') if amount else 'N/A',
'Client': client.group(1).strip() if client else 'N/A',
'Lines': line_count
})
print(f"✓ Invoice processed: {filename}")
# Create DataFrame and export to Excel
df = pd.DataFrame(invoice_data)
df.to_excel('invoice_extraction.xlsx', index=False)
print(f"\n{len(invoice_data)} invoices analyzed and exported to Excel!")
return df
# Usage
data = extract_invoice_data('./january_invoices')
print(data.head())
This script radically transforms the accounting workflow. What required eight hours of manual entry becomes automated extraction of a few minutes, with accuracy superior to that of a human tired after the hundredth invoice.
ReportLab: Generate Dynamic PDFs
Sometimes automation doesn't consist of manipulating existing PDFs, but creating new ones programmatically. ReportLab excels at this task.
Automatic generation of personalized reports
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import cm
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors
import datetime
def generate_monthly_report(sales_data, filename):
"""
Automatically generates a formatted PDF report from data.
"""
doc = SimpleDocTemplate(filename, pagesize=A4)
styles = getSampleStyleSheet()
elements = []
# Title
title = Paragraph(
f"<b>Monthly Sales Report - {datetime.date.today().strftime('%B %Y')}</b>",
styles['Title']
)
elements.append(title)
elements.append(Spacer(1, 1*cm))
# Executive summary
total_sales = sum([sale['amount'] for sale in sales_data])
summary = Paragraph(
f"Total revenue: <b>${total_sales:,.2f}</b><br/>"
f"Number of transactions: <b>{len(sales_data)}</b><br/>"
f"Average basket: <b>${total_sales/len(sales_data):,.2f}</b>",
styles['Normal']
)
elements.append(summary)
elements.append(Spacer(1, 1*cm))
# Sales table
table_data = [['Date', 'Client', 'Product', 'Amount ($)']]
for sale in sales_data:
table_data.append([
sale['date'],
sale['client'],
sale['product'],
f"{sale['amount']:,.2f}"
])
table = Table(table_data)
table.setStyle(TableStyle([
('BACKGROUND', (0, 0), (-1, 0), colors.grey),
('TEXTCOLOR', (0, 0), (-1, 0), colors.whitesmoke),
('ALIGN', (0, 0), (-1, -1), 'CENTER'),
('FONTNAME', (0, 0), (-1, 0), 'Helvetica-Bold'),
('FONTSIZE', (0, 0), (-1, 0), 12),
('BOTTOMPADDING', (0, 0), (-1, 0), 12),
('BACKGROUND', (0, 1), (-1, -1), colors.beige),
('GRID', (0, 0), (-1, -1), 1, colors.black)
]))
elements.append(table)
# Generate PDF
doc.build(elements)
print(f"Report generated: {filename}")
# Usage example with simulated data
january_sales = [
{'date': '01/01/2025', 'client': 'Company A', 'product': 'Premium Service', 'amount': 5420.00},
{'date': '01/03/2025', 'client': 'Company B', 'product': 'Standard Service', 'amount': 2350.00},
{'date': '01/05/2025', 'client': 'Company C', 'product': 'Premium Service', 'amount': 7890.00},
]
generate_monthly_report(january_sales, 'sales_report_january.pdf')
This approach revolutionizes report creation. Instead of manually creating Word documents then converting them to PDF, you directly generate formatted and personalized reports from your databases. A sales manager who manually generated ten client reports per week now saves six hours weekly.
Node.js and pdf-lib: JavaScript-Side Automation
Why Choose Node.js for PDF Automation
Python dominates server-side PDF automation, but JavaScript with Node.js offers decisive advantages in certain contexts. If your infrastructure already relies on Node.js, if you automate web workflows, or if you want to create internal tools accessible via browser, Node.js becomes the natural choice.
The pdf-lib library particularly shines through its ability to create, modify and manipulate PDFs entirely in JavaScript, client or server side, with remarkable performance.
pdf-lib: PDF Manipulation in Modern JavaScript
Installation and configuration
// Installation
npm install pdf-lib
// Import (ES6 modules)
import { PDFDocument, rgb, StandardFonts } from 'pdf-lib';
import fs from 'fs/promises';
Practical case: Automated watermarking system
A frequent use case: automatically add a watermark to all documents in a folder with dynamic information (date, version number, status).
import { PDFDocument, rgb, degrees } from 'pdf-lib';
import fs from 'fs/promises';
import path from 'path';
async function addWatermark(pdfFile, watermarkText, outputFile) {
/**
* Adds a diagonal watermark to all pages of a PDF.
*/
// Load existing PDF
const pdfBytes = await fs.readFile(pdfFile);
const pdfDoc = await PDFDocument.load(pdfBytes);
// Get all pages
const pages = pdfDoc.getPages();
// Loop through each page
for (const page of pages) {
const { width, height } = page.getSize();
// Draw diagonal watermark
page.drawText(watermarkText, {
x: width / 4,
y: height / 2,
size: 80,
color: rgb(0.95, 0.95, 0.95),
rotate: degrees(45),
opacity: 0.3,
});
}
// Save new PDF
const modifiedPdf = await pdfDoc.save();
await fs.writeFile(outputFile, modifiedPdf);
console.log(`✓ Watermark added: ${outputFile}`);
}
async function watermarkFolder(sourceFolder, watermarkText, destinationFolder) {
/**
* Applies watermark to all PDFs in a folder.
*/
// Create destination folder if it doesn't exist
await fs.mkdir(destinationFolder, { recursive: true });
// Read source folder contents
const files = await fs.readdir(sourceFolder);
let counter = 0;
for (const file of files) {
if (path.extname(file).toLowerCase() === '.pdf') {
const sourcePath = path.join(sourceFolder, file);
const destPath = path.join(destinationFolder, `watermark_${file}`);
await addWatermark(sourcePath, watermarkText, destPath);
counter++;
}
}
console.log(`\n${counter} files processed successfully!`);
}
// Usage with dynamic watermark
const today = new Date().toLocaleDateString('en-US');
await watermarkFolder(
'./draft_documents',
`DRAFT - ${today}`,
'./watermarked_documents'
);
Intelligent merging with table of contents
A superior level of automation: merge several PDFs while automatically adding an interactive table of contents.
import { PDFDocument, StandardFonts, rgb } from 'pdf-lib';
import fs from 'fs/promises';
async function mergeWithTOC(fileList, outputFile) {
/**
* Merges several PDFs and generates an interactive table of contents.
*/
const finalPdf = await PDFDocument.create();
const font = await finalPdf.embedFont(StandardFonts.Helvetica);
const fontBold = await finalPdf.embedFont(StandardFonts.HelveticaBold);
// Create table of contents page
const tocPage = finalPdf.addPage([595, 842]); // A4 format
let yPosition = 750;
tocPage.drawText('TABLE OF CONTENTS', {
x: 50,
y: yPosition,
size: 24,
font: fontBold,
color: rgb(0, 0, 0),
});
yPosition -= 50;
// Tracker for page numbers
let currentPageNumber = 2; // Starts at 2 (after TOC)
const tocEntries = [];
// Merge documents
for (const [index, file] of fileList.entries()) {
const pdfBytes = await fs.readFile(file.path);
const sourcePdf = await PDFDocument.load(pdfBytes);
// Copy pages
const copiedPages = await finalPdf.copyPages(sourcePdf, sourcePdf.getPageIndices());
// Add TOC entry
tocEntries.push({
title: file.title,
pageNumber: currentPageNumber,
yPosition: yPosition
});
// Draw TOC entry
tocPage.drawText(`${file.title}`, {
x: 70,
y: yPosition,
size: 14,
font: font,
color: rgb(0, 0, 0.8),
});
tocPage.drawText(`${currentPageNumber}`, {
x: 500,
y: yPosition,
size: 14,
font: font,
color: rgb(0, 0, 0),
});
yPosition -= 30;
// Add all pages to final document
for (const page of copiedPages) {
finalPdf.addPage(page);
}
currentPageNumber += copiedPages.length;
}
// Save
const pdfBytes = await finalPdf.save();
await fs.writeFile(outputFile, pdfBytes);
console.log(`✓ PDF merged with TOC: ${outputFile}`);
console.log(` Total pages: ${currentPageNumber - 1}`);
}
// Usage
const documentsToMerge = [
{ path: './executive_report.pdf', title: 'Executive Summary' },
{ path: './financial_analysis.pdf', title: 'Financial Analysis' },
{ path: './forecasts.pdf', title: '2025 Forecasts' },
{ path: './appendices.pdf', title: 'Appendices' },
];
await mergeWithTOC(documentsToMerge, './complete_report_with_toc.pdf');
This automation transforms composite report creation. Instead of manually merging and separately creating a table of contents in Word, everything generates automatically in seconds.
Complete Workflow Automation with Node.js
Node.js excels at orchestrating complex workflows involving multiple steps and services.
Practical case: Automated invoice processing pipeline
import { PDFDocument } from 'pdf-lib';
import fs from 'fs/promises';
import path from 'path';
class InvoicePipeline {
constructor(config) {
this.inputFolder = config.inputFolder;
this.outputFolder = config.outputFolder;
this.archiveFolder = config.archiveFolder;
this.stats = {
processed: 0,
errors: 0,
totalDuration: 0
};
}
async processBatch() {
/**
* Complete pipeline: validation → renaming → compression → archiving
*/
const start = Date.now();
console.log('🚀 Starting processing pipeline...\n');
// Create necessary folders
await this.createFolders();
// Read all files
const files = await fs.readdir(this.inputFolder);
const pdfFiles = files.filter(f => f.endsWith('.pdf'));
console.log(`📄 ${pdfFiles.length} PDF files detected\n`);
// Process each file
for (const file of pdfFiles) {
await this.processFile(file);
}
// Final statistics
const end = Date.now();
this.stats.totalDuration = ((end - start) / 1000).toFixed(2);
this.displayReport();
}
async createFolders() {
await fs.mkdir(this.outputFolder, { recursive: true });
await fs.mkdir(this.archiveFolder, { recursive: true });
}
async processFile(filename) {
try {
const sourcePath = path.join(this.inputFolder, filename);
console.log(`⚙️ Processing: ${filename}`);
// Step 1: Validate PDF
const isValid = await this.validatePDF(sourcePath);
if (!isValid) {
console.log(` ❌ Invalid file, skipped\n`);
this.stats.errors++;
return;
}
// Step 2: Extract metadata and rename
const newName = await this.extractAndRename(sourcePath);
// Step 3: Compress
const compressedPath = await this.compress(
path.join(this.outputFolder, newName)
);
// Step 4: Archive original
await this.archive(sourcePath, filename);
console.log(` ✅ Successfully processed → ${newName}\n`);
this.stats.processed++;
} catch (error) {
console.error(` ❌ Error: ${error.message}\n`);
this.stats.errors++;
}
}
async validatePDF(filePath) {
try {
const bytes = await fs.readFile(filePath);
await PDFDocument.load(bytes);
return true;
} catch {
return false;
}
}
async extractAndRename(sourcePath) {
const bytes = await fs.readFile(sourcePath);
const pdfDoc = await PDFDocument.load(bytes);
// Extract metadata (title, creation date)
const title = pdfDoc.getTitle() || 'document';
const date = new Date().toISOString().split('T')[0];
const pageCount = pdfDoc.getPageCount();
// Build new name: YYYYMMDD_title_XXpages.pdf
const newName = `${date}_${this.normalizeName(title)}_${pageCount}p.pdf`;
// Copy to output folder
await fs.copyFile(sourcePath, path.join(this.outputFolder, newName));
return newName;
}
normalizeName(text) {
return text
.toLowerCase()
.replace(/[^a-z0-9]/g, '_')
.replace(/_+/g, '_')
.substring(0, 50);
}
async compress(filePath) {
// Compression simulation (in real life, use Ghostscript or API)
// Here, we simply return the path
console.log(` 🗜️ Compression simulated`);
return filePath;
}
async archive(sourcePath, originalName) {
const archivePath = path.join(this.archiveFolder, originalName);
await fs.rename(sourcePath, archivePath);
}
displayReport() {
console.log('═'.repeat(50));
console.log('📊 FINAL REPORT');
console.log('═'.repeat(50));
console.log(`✅ Files successfully processed: ${this.stats.processed}`);
console.log(`❌ Errors encountered: ${this.stats.errors}`);
console.log(`⏱️ Total duration: ${this.stats.totalDuration}s`);
console.log('═'.repeat(50));
}
}
// Usage
const pipeline = new InvoicePipeline({
inputFolder: './raw_invoices',
outputFolder: './processed_invoices',
archiveFolder: './archived_invoices'
});
await pipeline.processBatch();
This complete pipeline automates a workflow that would take hours manually. Each night, the system can process hundreds of invoices: validation, intelligent renaming, compression, archiving. The next morning, everything is ready, organized, and traceable.
PDF APIs: Cloud Power for Automation
When to Use an API Rather Than a Local Library
Python and Node.js libraries are excellent for local automation, but certain scenarios require the power and scalability of cloud APIs:
Very large-scale processing: When you need to process thousands of PDFs simultaneously, cloud APIs offer massive parallelization impossible locally.
Advanced features: High-quality OCR, complex format conversions, intelligent structured data extraction... Specialized APIs often surpass local solutions.
Integration in web applications: For SaaS tools or online user interfaces, APIs naturally integrate into your architecture.
No infrastructure maintenance: APIs eliminate the need to manage servers, library updates, or compatibility issues.
Main Market APIs
Adobe PDF Services API dominates the market with exhaustive functional coverage. Creation, conversion, compression, OCR, data extraction, watermarking... Adobe offers an API for practically every imaginable PDF operation.
// Example: Compression via Adobe PDF Services API
import { ServicePrincipalCredentials, PDFServices, CompressPDFJob } from '@adobe/pdfservices-node-sdk';
const credentials = new ServicePrincipalCredentials({
clientId: process.env.PDF_SERVICES_CLIENT_ID,
clientSecret: process.env.PDF_SERVICES_CLIENT_SECRET
});
const pdfServices = new PDFServices({ credentials });
const inputAsset = await pdfServices.upload({
readStream: fs.createReadStream('./large_file.pdf')
});
const job = new CompressPDFJob({ inputAsset });
const pollingURL = await pdfServices.submit({ job });
const result = await pdfServices.getJobResult({ pollingURL });
const resultAsset = result.asset;
const streamAsset = await pdfServices.getContent({ asset: resultAsset });
fs.createWriteStream('./compressed_file.pdf').write(streamAsset.readStream);
console.log('Compression completed via Adobe API');
PDF.co offers a more affordable alternative with a simple RESTful API and excellent documentation. Particularly appreciated for conversions and data extraction.
CloudConvert excels at converting multiple formats, PDF included. Its generalist API handles hundreds of different formats.
AWS Textract specializes in intelligent data extraction from scanned PDFs, with exceptional recognition of tables and forms.
Practical case: Automated invoice data extraction with AWS Textract
import AWS from 'aws-sdk';
import fs from 'fs/promises';
class AWSInvoiceExtractor {
constructor() {
this.textract = new AWS.Textract({
region: process.env.AWS_REGION,
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});
}
async extractInvoiceData(pdfPath) {
/**
* Uses AWS Textract to intelligently extract data
* from a PDF invoice, even scanned.
*/
// Load PDF as bytes
const pdfBytes = await fs.readFile(pdfPath);
// Send to Textract for analysis
const params = {
Document: { Bytes: pdfBytes },
FeatureTypes: ['TABLES', 'FORMS']
};
console.log(`🔍 Analyzing ${pdfPath} with AWS Textract...`);
const result = await this.textract.analyzeDocument(params).promise();
// Parse results
const data = this.parseTextractResults(result);
console.log(`✅ Extraction completed:\n`, JSON.stringify(data, null, 2));
return data;
}
parseTextractResults(result) {
const data = {
key_value_pairs: {},
tables: [],
raw_text: ''
};
// Extract key-value pairs (forms)
const blockMap = {};
result.Blocks.forEach(block => {
blockMap[block.Id] = block;
});
result.Blocks.forEach(block => {
if (block.BlockType === 'KEY_VALUE_SET' && block.EntityTypes?.includes('KEY')) {
const keyText = this.extractText(block, blockMap);
const valueBlock = block.Relationships?.find(r => r.Type === 'VALUE');
if (valueBlock) {
const valueId = valueBlock.Ids[0];
const valueText = this.extractText(blockMap[valueId], blockMap);
data.key_value_pairs[keyText] = valueText;
}
}
});
// Extract tables
result.Blocks.forEach(block => {
if (block.BlockType === 'TABLE') {
const table = this.extractTable(block, blockMap);
data.tables.push(table);
}
});
return data;
}
extractText(block, blockMap) {
if (block.Text) return block.Text;
let text = '';
if (block.Relationships) {
block.Relationships.forEach(relation => {
if (relation.Type === 'CHILD') {
relation.Ids.forEach(id => {
const childBlock = blockMap[id];
if (childBlock.Text) {
text += childBlock.Text + ' ';
}
});
}
});
}
return text.trim();
}
extractTable(block, blockMap) {
const cells = {};
block.Relationships?.forEach(relation => {
if (relation.Type === 'CHILD') {
relation.Ids.forEach(id => {
const cell = blockMap[id];
if (cell.BlockType === 'CELL') {
const row = cell.RowIndex;
const col = cell.ColumnIndex;
const text = this.extractText(cell, blockMap);
if (!cells[row]) cells[row] = {};
cells[row][col] = text;
}
});
}
});
// Convert to 2D array
return Object.values(cells).map(row => Object.values(row));
}
}
// Usage
const extractor = new AWSInvoiceExtractor();
const invoices = [
'./invoice_01.pdf',
'./invoice_02.pdf',
'./invoice_03.pdf'
];
for (const invoice of invoices) {
const data = await extractor.extractInvoiceData(invoice);
// Save as JSON
const outputName = invoice.replace('.pdf', '_data.json');
await fs.writeFile(outputName, JSON.stringify(data, null, 2));
}
This approach revolutionizes scanned invoice processing. AWS Textract intelligently recognizes document structure, extracts key fields (invoice number, date, total amount), and parses invoice line tables with over 95% accuracy.
Batch Processing: Automate at Large Scale
Optimization Strategies for Massive Processing
When you process dozens, hundreds or thousands of PDFs, optimization strategies become critical. A naive script that processes files sequentially will take hours while an optimized approach will accomplish the same task in minutes.
Parallelization with Python multiprocessing
from multiprocessing import Pool, cpu_count
import PyPDF2
import os
import time
def process_one_pdf(file_info):
"""
Function that processes a single PDF (will be executed in parallel).
"""
source_path, destination_folder = file_info
filename = os.path.basename(source_path)
try:
# Open and process PDF
reader = PyPDF2.PdfReader(source_path)
writer = PyPDF2.PdfWriter()
# Example: Extract odd pages
for i in range(0, len(reader.pages), 2):
writer.add_page(reader.pages[i])
# Save
output_path = os.path.join(destination_folder, f"processed_{filename}")
with open(output_path, 'wb') as output_file:
writer.write(output_file)
return f"✓ {filename}"
except Exception as e:
return f"✗ {filename}: {str(e)}"
def massive_parallel_processing(source_folder, destination_folder, num_processes=None):
"""
Processes all PDFs in a folder in parallel to maximize speed.
"""
# Create destination folder
os.makedirs(destination_folder, exist_ok=True)
# List all PDFs
pdf_files = [
(os.path.join(source_folder, f), destination_folder)
for f in os.listdir(source_folder)
if f.endswith('.pdf')
]
print(f"🚀 Processing {len(pdf_files)} files...")
print(f"⚙️ Using {num_processes or cpu_count()} parallel processes\n")
start = time.time()
# Create process pool
with Pool(processes=num_processes) as pool:
# Process in parallel with progress bar
results = pool.map(process_one_pdf, pdf_files)
duration = time.time() - start
# Display results
print("\n" + "="*60)
for result in results:
print(result)
print("="*60)
print(f"⏱️ Total duration: {duration:.2f} seconds")
print(f"📊 Speed: {len(pdf_files)/duration:.1f} files/second")
# Usage
massive_parallel_processing(
source_folder='./pdfs_to_process',
destination_folder='./processed_pdfs',
num_processes=8 # Use 8 CPU cores
)
On a modern computer with 8 cores, this parallelized approach can process up to 5 to 10 times faster than sequential processing. 1000 files that would take 30 minutes sequentially are processed in 5 minutes with parallelization.
Batch processing with robust error handling
import logging
from datetime import datetime
import json
class RobustBatchProcessing:
"""
Batch processing system with logging, error recovery,
and detailed reports.
"""
def __init__(self, source_folder, destination_folder):
self.source_folder = source_folder
self.destination_folder = destination_folder
self.stats = {
'total': 0,
'successful': 0,
'errors': 0,
'error_details': []
}
# Configure logging
self.logger = logging.getLogger('PDFProcessing')
self.logger.setLevel(logging.INFO)
# File handler
fh = logging.FileHandler(f'processing_{datetime.now().strftime("%Y%m%d_%H%M%S")}.log')
fh.setLevel(logging.INFO)
# Console handler
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)
self.logger.addHandler(fh)
self.logger.addHandler(ch)
def process_batch(self, batch_size=50):
"""
Processes files in batches to optimize memory.
"""
self.logger.info("🚀 Starting batch processing")
# List all files
all_files = [
f for f in os.listdir(self.source_folder)
if f.endswith('.pdf')
]
self.stats['total'] = len(all_files)
self.logger.info(f"📄 {self.stats['total']} files detected")
# Process in batches
for i in range(0, len(all_files), batch_size):
batch = all_files[i:i+batch_size]
batch_number = i // batch_size + 1
self.logger.info(f"\n📦 Processing batch {batch_number} ({len(batch)} files)")
for file in batch:
self.process_file_with_error_handling(file)
self.generate_report()
def process_file_with_error_handling(self, filename):
"""
Processes a file with complete error handling.
"""
source_path = os.path.join(self.source_folder, filename)
try:
# Your processing logic here
reader = PyPDF2.PdfReader(source_path)
# Check integrity
if reader.is_encrypted:
raise ValueError("Encrypted PDF")
if len(reader.pages) == 0:
raise ValueError("Empty PDF")
# Processing (example: rotation)
writer = PyPDF2.PdfWriter()
for page in reader.pages:
page.rotate(90)
writer.add_page(page)
# Save
output_path = os.path.join(self.destination_folder, filename)
with open(output_path, 'wb') as f:
writer.write(f)
self.stats['successful'] += 1
self.logger.info(f" ✅ {filename}")
except Exception as e:
self.stats['errors'] += 1
error_detail = {
'file': filename,
'error': str(e),
'type': type(e).__name__
}
self.stats['error_details'].append(error_detail)
self.logger.error(f" ❌ {filename}: {str(e)}")
def generate_report(self):
"""
Generates detailed processing report.
"""
self.logger.info("\n" + "="*70)
self.logger.info("📊 FINAL REPORT")
self.logger.info("="*70)
self.logger.info(f"Total files: {self.stats['total']}")
self.logger.info(f"✅ Successful: {self.stats['successful']}")
self.logger.info(f"❌ Errors: {self.stats['errors']}")
success_rate = (self.stats['successful'] / self.stats['total'] * 100) if self.stats['total'] > 0 else 0
self.logger.info(f"📈 Success rate: {success_rate:.1f}%")
# Save JSON report
json_report = {
'timestamp': datetime.now().isoformat(),
'stats': self.stats
}
with open(f'report_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json', 'w') as f:
json.dump(json_report, f, indent=2, ensure_ascii=False)
self.logger.info("="*70)
# Usage
processing = RobustBatchProcessing(
source_folder='./source_pdfs',
destination_folder='./processed_pdfs'
)
processing.process_batch(batch_size=100)
This robust system gracefully handles errors, logs all events, generates detailed reports, and allows tracing exactly what happened during massive processing. Essential in professional environments.
Concrete Use Cases: From Theory to Practice
Accounting Automation: Marie's System
Let's return to Marie, our accountant from the beginning. Here's the complete system she set up:
import PyPDF2
import pdfplumber
import pandas as pd
import os
import re
from datetime import datetime
import shutil
class AccountingAutomationSystem:
"""
Complete invoice automation system for accounting department.
"""
def __init__(self, config):
self.raw_invoices_folder = config['raw_folder']
self.processed_folder = config['processed_folder']
self.archive_folder = config['archive_folder']
self.extraction_file = config['extraction_file']
self.extracted_data = []
self.stats = {'processed': 0, 'errors': 0}
def execute_complete_pipeline(self):
"""
Complete pipeline: extraction → renaming → merge by client → archiving
"""
print("🚀 Starting automated accounting pipeline\n")
start = datetime.now()
# Step 1: Extract data and rename
print("📊 Step 1: Data extraction and renaming...")
self.extract_and_rename_all_invoices()
# Step 2: Merge by client
print("\n📦 Step 2: Merging invoices by client...")
self.merge_by_client()
# Step 3: Compress
print("\n🗜️ Step 3: Compressing merged PDFs...")
self.compress_merged_pdfs()
# Step 4: Archive originals
print("\n📁 Step 4: Archiving original invoices...")
self.archive_originals()
# Step 5: Export data
print("\n💾 Step 5: Exporting extracted data...")
self.export_data_excel()
# Final report
duration = (datetime.now() - start).total_seconds()
print("\n" + "="*60)
print("✅ PIPELINE COMPLETED SUCCESSFULLY")
print("="*60)
print(f"⏱️ Total duration: {duration:.1f} seconds")
print(f"📄 Invoices processed: {self.stats['processed']}")
print(f"❌ Errors: {self.stats['errors']}")
print("="*60)
def extract_and_rename_all_invoices(self):
"""
Extracts data from all invoices and renames them intelligently.
"""
os.makedirs(self.processed_folder, exist_ok=True)
for file in os.listdir(self.raw_invoices_folder):
if file.endswith('.pdf'):
source_path = os.path.join(self.raw_invoices_folder, file)
try:
data = self.extract_invoice_data(source_path)
if data:
# Rename: YYYYMMDD_ClientName_InvoiceNum.pdf
new_name = f"{data['date_iso']}_{data['client_norm']}_{data['number']}.pdf"
dest_path = os.path.join(self.processed_folder, new_name)
shutil.copy(source_path, dest_path)
self.extracted_data.append(data)
self.stats['processed'] += 1
print(f" ✓ {file} → {new_name}")
except Exception as e:
self.stats['errors'] += 1
print(f" ✗ {file}: {str(e)}")
def extract_invoice_data(self, pdf_path):
"""
Extracts key information from an invoice.
"""
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[0]
text = page.extract_text()
# Regular expressions for extraction
number = re.search(r'(?:No\.|Invoice|Invoice)\s*:?\s*([A-Z0-9-]+)', text, re.IGNORECASE)
date = re.search(r'Date\s*:?\s*(\d{2}[/-]\d{2}[/-]\d{4})', text)
amount = re.search(r'(?:Total|Amount)\s*:?\s*([\d\s,\.]+)\s*\$', text)
client = re.search(r'(?:Client|Customer)\s*:?\s*(.+)', text)
if number and date and amount:
# Normalize date
raw_date = date.group(1)
date_obj = datetime.strptime(raw_date.replace('/', '-'), '%m-%d-%Y')
return {
'original_file': os.path.basename(pdf_path),
'number': number.group(1),
'date': raw_date,
'date_iso': date_obj.strftime('%Y%m%d'),
'amount': amount.group(1).replace(' ', '').replace(',', '.'),
'client': client.group(1).strip() if client else 'UNKNOWN',
'client_norm': self.normalize_client_name(client.group(1).strip() if client else 'UNKNOWN')
}
return None
def normalize_client_name(self, name):
"""
Normalizes client name to create valid filenames.
"""
return re.sub(r'[^a-zA-Z0-9]', '_', name)[:30].upper()
def merge_by_client(self):
"""
Merges all invoices from each client into a single PDF.
"""
# Group by client
invoices_by_client = {}
for file in os.listdir(self.processed_folder):
if file.endswith('.pdf'):
# Extract client name from filename
parts = file.split('_')
if len(parts) >= 3:
client = parts[1]
if client not in invoices_by_client:
invoices_by_client[client] = []
invoices_by_client[client].append(file)
# Merge each group
merge_folder = os.path.join(self.processed_folder, 'client_merges')
os.makedirs(merge_folder, exist_ok=True)
for client, invoices in invoices_by_client.items():
if len(invoices) > 1:
merger = PyPDF2.PdfMerger()
# Sort by date (in filename)
invoices.sort()
for invoice in invoices:
path = os.path.join(self.processed_folder, invoice)
merger.append(path)
# Save merge
merge_name = f"MERGE_{client}_{datetime.now().strftime('%Y%m')}.pdf"
merge_path = os.path.join(merge_folder, merge_name)
merger.write(merge_path)
merger.close()
print(f" ✓ {client}: {len(invoices)} invoices merged → {merge_name}")
def compress_merged_pdfs(self):
"""
Compresses merged PDFs to save space.
"""
merge_folder = os.path.join(self.processed_folder, 'client_merges')
if os.path.exists(merge_folder):
print(" (Compression simulated - in production, use Ghostscript)")
def archive_originals(self):
"""
Archives original invoices by month.
"""
current_month = datetime.now().strftime('%Y-%m')
month_archive_folder = os.path.join(self.archive_folder, current_month)
os.makedirs(month_archive_folder, exist_ok=True)
for file in os.listdir(self.raw_invoices_folder):
if file.endswith('.pdf'):
source = os.path.join(self.raw_invoices_folder, file)
destination = os.path.join(month_archive_folder, file)
shutil.move(source, destination)
print(f" ✓ Invoices archived in {month_archive_folder}")
def export_data_excel(self):
"""
Exports all extracted data to Excel.
"""
if self.extracted_data:
df = pd.DataFrame(self.extracted_data)
df.to_excel(self.extraction_file, index=False)
print(f" ✓ Data exported: {self.extraction_file}")
# Configuration and execution
config = {
'raw_folder': './january_raw_invoices',
'processed_folder': './processed_invoices',
'archive_folder': './invoice_archives',
'extraction_file': f'invoice_extraction_{datetime.now().strftime("%Y%m%d")}.xlsx'
}
system = AccountingAutomationSystem(config)
system.execute_complete_pipeline()
This complete system transformed Marie's professional life. Eight hours of manual work per month become seven minutes of automatic execution. More importantly, automated data extraction eliminates entry errors and allows immediate analysis in Excel.
Deployment and Scheduling: Making Automation Truly Automatic
Complete Automation with Task Schedulers
Ultimate automation consists of not even having to manually launch scripts. Task schedulers transform your scripts into truly automatic processes.
On Linux/Mac with cron
# Edit crontab
crontab -e
# Execute invoice processing script every Monday at 8:00 AM
0 8 * * 1 /usr/bin/python3 /home/marie/scripts/invoice_processing.py
# Execute monthly report on 1st of each month at 9:00 AM
0 9 1 * * /usr/bin/python3 /home/marie/scripts/monthly_report.py
# Save logs
0 8 * * 1 /usr/bin/python3 /home/marie/scripts/invoice_processing.py >> /var/log/invoices.log 2>&1
On Windows with Task Scheduler
# Python script to create Windows scheduled task
import win32com.client
from datetime import datetime, timedelta
def create_windows_scheduled_task(task_name, script_path, execution_time):
"""
Creates Windows scheduled task to execute Python script.
"""
scheduler = win32com.client.Dispatch('Schedule.Service')
scheduler.Connect()
root_folder = scheduler.GetFolder('\\')
task_def = scheduler.NewTask(0)
# Define trigger (daily at specified time)
trigger = task_def.Triggers.Create(2) # 2 = daily
trigger.StartBoundary = datetime.now().replace(
hour=execution_time.hour,
minute=execution_time.minute,
second=0
).isoformat()
# Define action (execute Python)
action = task_def.Actions.Create(0) # 0 = execute
action.Path = 'C:\\Python39\\python.exe'
action.Arguments = script_path
# Settings
task_def.RegistrationInfo.Description = f'PDF Automation - {task_name}'
task_def.Settings.Enabled = True
task_def.Settings.StopIfGoingOnBatteries = False
# Register task
root_folder.RegisterTaskDefinition(
task_name,
task_def,
6, # TASK_CREATE_OR_UPDATE
None,
None,
3 # TASK_LOGON_INTERACTIVE_TOKEN
)
print(f"✅ Scheduled task created: {task_name}")
# Usage
create_windows_scheduled_task(
task_name='AutomaticInvoiceProcessing',
script_path='C:\\Scripts\\invoice_processing.py',
execution_time=datetime.now().replace(hour=8, minute=0)
)
Monitoring and Alerts
A professional automation system includes notifications to inform users of success or errors.
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders
class EmailNotifier:
"""
Sends email notifications about automation status.
"""
def __init__(self, smtp_server, smtp_port, sender_email, password):
self.smtp_server = smtp_server
self.smtp_port = smtp_port
self.sender_email = sender_email
self.password = password
def send_processing_report(self, recipient, stats, log_file=None):
"""
Sends report email after automatic processing.
"""
message = MIMEMultipart()
message['From'] = self.sender_email
message['To'] = recipient
message['Subject'] = f"✅ Automatic PDF processing completed - {datetime.now().strftime('%m/%d/%Y')}"
# Message body
body = f"""
<html>
<body>
<h2>Automatic PDF Processing Report</h2>
<p>Automatic invoice processing completed successfully.</p>
<h3>Statistics:</h3>
<ul>
<li><b>Total files:</b> {stats['total']}</li>
<li><b>✅ Successful:</b> {stats['successful']}</li>
<li><b>❌ Errors:</b> {stats['errors']}</li>
<li><b>⏱️ Duration:</b> {stats['duration']}</li>
</ul>
<p>Processed files are available in the shared folder.</p>
<p style="color: #666; font-size: 12px;">
Automatic message generated by PDF automation system
</p>
</body>
</html>
"""
message.attach(MIMEText(body, 'html'))
# Attach log file if provided
if log_file and os.path.exists(log_file):
with open(log_file, 'rb') as f:
part = MIMEBase('application', 'octet-stream')
part.set_payload(f.read())
encoders.encode_base64(part)
part.add_header('Content-Disposition', f'attachment; filename={os.path.basename(log_file)}')
message.attach(part)
# Send
try:
with smtplib.SMTP(self.smtp_server, self.smtp_port) as server:
server.starttls()
server.login(self.sender_email, self.password)
server.send_message(message)
print(f"📧 Report email sent to {recipient}")
except Exception as e:
print(f"❌ Email sending error: {str(e)}")
# Usage in your main script
notifier = EmailNotifier(
smtp_server='smtp.gmail.com',
smtp_port=587,
sender_email='automation@company.com',
password='your_app_password'
)
# After processing
stats = {
'total': 347,
'successful': 342,
'errors': 5,
'duration': '7 minutes 23 seconds'
}
notifier.send_processing_report(
recipient='marie@company.com',
stats=stats,
log_file='processing_20250110.log'
)
Conclusion: Automation as Strategic Investment
PDF task automation represents much more than simple technical optimization. It's a profound transformation of your relationship with work, a recovery of precious time, and an elimination of repetitive tasks that erode motivation and create errors.
The numbers speak for themselves. A professional who automates their repetitive PDF tasks recovers on average 5 to 10 hours per week. Over a year, that represents 260 to 520 hours, or 6 to 13 weeks of productive work. The return on investment of an automation project is generally measured in days or weeks, not months.
Start small. Identify a single repetitive task you perform regularly. Dedicate a few hours to creating a simple script to automate it. Measure the time savings. Then move to the next task. Each automation adds up, progressively creating a system that radically transforms your productivity.
PDF automation isn't reserved for expert developers. Modern tools, well-documented libraries, and accessible APIs democratize this technology. An accountant, a lawyer, an administrative assistant can master the basics in a few days and create automations that change their professional life.
The real investment isn't financial but temporal. A few hours of initial learning and development generate hundreds of recovered hours afterward. It's probably one of the best investments you can make in your productivity and professional well-being.
The future of work doesn't belong to humans who do machines' work, but to humans who know how to orchestrate machines to multiply their impact. PDF automation is your gateway to this future. Cross it today.
To start immediately, use our PDF merge tool or our PDF compressor online. These free tools allow you to manipulate your PDFs manually while waiting to develop your own automations. Each manual task you perform is an opportunity to identify your next automation project.
FAQ: Your Questions About PDF Automation
Do I need to be a developer to automate my PDF tasks?
No, absolutely not. Modern libraries like PyPDF2 and pdf-lib are designed to be accessible to beginners. With a few hours of basic Python training (available free on platforms like Codecademy or FreeCodeCamp), you can create your first automation scripts. Many non-technical professionals (accountants, lawyers, administrative assistants) successfully automate their PDF tasks after initial training of 10 to 15 hours.
What's the difference between a local library and a cloud API?
Local libraries (PyPDF2, pdf-lib) execute directly on your computer, without internet connection required. Your files never leave your machine, ensuring maximum confidentiality. Cloud APIs require sending your PDFs to remote servers for processing. They generally offer more advanced features (high-quality OCR, complex conversions) and superior scalability, but involve confidentiality considerations and usage costs. For most automation needs, local libraries are entirely sufficient.
How much does setting up a PDF automation system cost?
The cost can be almost zero. Python and all mentioned libraries (PyPDF2, pdfplumber, ReportLab) are free and open source. Node.js and pdf-lib as well. The only real investment is your learning and development time. If you prefer to outsource, a freelance developer can create a custom automation system for 500 to 2000 euros depending on complexity, with ROI generally achieved in a few weeks.
My PDFs contain sensitive data. Is automation secure?
Using local libraries (PyPDF2, pdf-lib), your files never leave your computer. This is perfectly secure, equivalent to manual manipulation. If you use cloud APIs, always check the provider's privacy policy. Serious services (Adobe, AWS) offer confidentiality guarantees and immediate file deletion after processing, with security certifications (SOC 2, ISO 27001). For ultra-sensitive data (medical, legal), systematically favor local solutions.
Can we automate data extraction from scanned PDFs?
Yes, thanks to OCR (Optical Character Recognition). For simple needs, use the Python pytesseract library (free). For professional accuracy, favor AWS Textract or Adobe PDF Services API which offer exceptional recognition, including complex tables and handwriting. These cloud APIs generally charge per page (a few cents), making cost negligible for most uses.
How long does it take to develop a first automation?
For a simple task (merge PDFs, extract specific pages), count 30 minutes to 2 hours for a beginner, including learning basics. For a more complex system like Marie's complete accounting pipeline, plan 1 to 3 days for a beginner, a few hours for someone with Python basics. The learning curve is fast: your third automation will be 5 times faster to develop than the first.
Can automation handle very large volumes?
Absolutely. With parallelization techniques (Python multiprocessing, JavaScript async/await), you can process thousands of PDFs simultaneously. A modern computer with 8 cores can process 500 to 1000 simple PDFs per minute. For even larger volumes (tens of thousands), cloud APIs offer virtually unlimited scalability. Batch processing with robust error handling ensures reliability even on massive corpora.