Text from .pdf file(s), again...
Hello friends,
I see Kim didn't get around to making app.readFileText() read .pdf files (or at least didn't talk about it in version log/what's new), so I'm going to put my pdf reader script here for anyone interested. This is specific to reading a bank statement for the statement date, like Laurent asked in some recent threads, but could be made to do other duties. You need a PDF editor that can OCR the files, preferably in a batch if you have a lot to do. I use PDF Studio and it works like a charm. Then you need to know what key phrase you can use to locate the information you want. Look in your OCRed files to find what you need. In my statements that is just "Statement Date ", but it can be changed in pre-batch if needed. The whole process is (once you've pasted the pre-batch and main into a script method):
1. (Batch) OCR your documents. Set your program to save the files with the same names, but with .txt extension.
2. Load the TXT files into an instance of ARen in which you've loaded the script.
3. Make sure "Pair renaming" is turned on.
3. Load the PDFs after the text files are showing the changes. If you need to change anything in the script, do it and save the changes. If you need help just ask.
EDIT: I said in the first draft you could load all the files together, but apparently not - text files first, so it seems. You can actually load them all at once, but the text files need to be processed first, so in Explorer click on the File Type header (probably twice to get the text files to the top), then drag them in. END EDIT
The only real caveat is if you accidentally load some other file in the whole thing may error out until you remove the offending file. If it's a file that is listed in the "Settings / JS Script editor / Allowed extensions for reading content from files:" box (and of course it's a text file) you shouldn't get an error, but you'll probably get some weird endings on the file.
Oh, the script as it stands expects the date to be like "January 15 2025". Also it should handlle abbreviated English text months (Jan, Sep, Sept, etc). Different languages would need the sub-function "f_testMonth()" changed to look for the correct words, and of course the date format. If there are commas or other punctuation in the date the search pattern will need to be changed (probably). Of course, it could be made to look for almost any information in a text file with the proper massaging.
Screenshot: https://drive.google.com/file/d/11_fEXLTVgVMdCO1z37HhszslCRT o6NJB/view?usp=sharing
The script:
//--------------------------------------------
// PRE-BATCH:
let fName = fileText = monthNum = lineText = "" ;
let yearLoc ;
let dateStrip ;
let test ; let fileSearchLoc ;
let lineArr = [] ;
const locatorStr = "Statement Date " // CHANGE THIS TO LOCATE OTHER TEXT STRINGS
const dateLocatorPattern = new RegExp( locatorStr, "i" ) ;
const locatorLength = locatorStr.length ;
lineArr = f_FindStatementDate() ;
function f_FindStatementDate() {
// Parse each file, extract the information, put in array lineArr;
for ( let j = 0; j < app.itemCount; j++ ) {
// Make file+path fullname:
fName = app.getItem(j).path + app.getItem(j).name + app.getItem(j).extension ;
// Read the entire file into the var fileText:
fileText = app.readFileText( fName ) ;
// Locate the data you want:
fileSearchLoc = fileText.search( dateLocatorPattern ) + locatorLength ; // "15" is number of characters in locator string
if ( ! fileSearchLoc ) {
lineArr[j] = "" ;
continue ;
}
if ( fileSearchLoc == -1 ) {
lineArr[j] = "" ;
continue ;
}
//Extract the info (this just brute-forces the next 25 characters after the search-phrase,
// enough to cover any date format):
fileText = fileText.substring( fileSearchLoc, fileSearchLoc + 25 ) ;
// Find the end of the year in fileText (this assumes year is last in date):
yearLoc = ( fileText.search( /\d{4}/ ) + 4 ) ;
// Remove everything after the year:
dateStrip = fileText.slice( 0, yearLoc ) ;
// Find the month in text (in my bank statement, anyway):
test = dateStrip.match( /[A-Za-z]+/ ) ;
// Send the text month to function f_testMonth() to get the month number as a string:
monthNum = f_testMonth( test[0] ) ;
// Replace the month text with month number string:
dateStrip = dateStrip.replace( /(\w+) (\d+), (\d{4})/, "$3"+monthNum+"$2" ) ;
// Add the processed date to lineArr[]:
lineArr[j] = dateStrip ;
}
return lineArr ;
}
// FUNCTION f_testMonth() - convert month name text to month number (string):
// This would need reworking for other languages/conventions
function f_testMonth( str ) {
let retString ;
switch( str ) {
case "January" :
case "Jan" :
retString = "01" ;
break ;
case "February" :
case "Feb" :
retString = "02" ;
break ;
case "March" :
case "Mar" :
retString = "03" ;
break ;
case "April" :
case "Apr" :
retString = "04" ;
break ;
case "May" :
retString = "05" ;
break ;
case "June" :
case "Jun" :
retString = "06" ;
break ;
case "July" :
case "Jul" :
retString = "07" ;
break ;
case "August" :
case "Aug" :
retString = "08" ;
break ;
case "September" :
case "Sept" :
case "Sep" :
retString = "09" ;
break ;
case "October" :
case "Oct" :
retString = "10" ;
break ;
case "November" :
case "Nov" :
retString = "11" ;
break ;
case "December" :
case "Dec" :
retString = "12" ;
break ;
}
return retString ;
}
//--------------------------------------------
//--------------------------------------------
// MAIN:
// BANK_STATEMENTtxt2pdf.js.aren
// 1. Make any changes needed in Pre-Batch, or here, like different
// search string. (save changes)
// 2. OCR your documents (my pdf editor,PDF Studio, has a "batch"
// function that allows me to ocr any number of PDFs at one
// time. It automatically names the text files with
// the base name of the .pdf - which is what is needed.)
// 3. Load this script, then load your pdf and txt files, making
// sure to load them so that the text files are processed first.
// 4. Make sure "Pair renaming" is checked in the menu.
// Everything is already done; all that's needed is to
// retrieve the correct element in lineArr[] and return it
// formatted to the old filename (with any other formatting
// needed, i.e., if you want dashes between YYYY-MM-DD):
dateNew = lineArr[item.index] ;
return item.newBasename + "_" + dateNew ;
//--------------------------------------------
Let me know if I can help...
Best,
DF
I see Kim didn't get around to making app.readFileText() read .pdf files (or at least didn't talk about it in version log/what's new), so I'm going to put my pdf reader script here for anyone interested. This is specific to reading a bank statement for the statement date, like Laurent asked in some recent threads, but could be made to do other duties. You need a PDF editor that can OCR the files, preferably in a batch if you have a lot to do. I use PDF Studio and it works like a charm. Then you need to know what key phrase you can use to locate the information you want. Look in your OCRed files to find what you need. In my statements that is just "Statement Date ", but it can be changed in pre-batch if needed. The whole process is (once you've pasted the pre-batch and main into a script method):
1. (Batch) OCR your documents. Set your program to save the files with the same names, but with .txt extension.
2. Load the TXT files into an instance of ARen in which you've loaded the script.
3. Make sure "Pair renaming" is turned on.
3. Load the PDFs after the text files are showing the changes. If you need to change anything in the script, do it and save the changes. If you need help just ask.
EDIT: I said in the first draft you could load all the files together, but apparently not - text files first, so it seems. You can actually load them all at once, but the text files need to be processed first, so in Explorer click on the File Type header (probably twice to get the text files to the top), then drag them in. END EDIT
The only real caveat is if you accidentally load some other file in the whole thing may error out until you remove the offending file. If it's a file that is listed in the "Settings / JS Script editor / Allowed extensions for reading content from files:" box (and of course it's a text file) you shouldn't get an error, but you'll probably get some weird endings on the file.
Oh, the script as it stands expects the date to be like "January 15 2025". Also it should handlle abbreviated English text months (Jan, Sep, Sept, etc). Different languages would need the sub-function "f_testMonth()" changed to look for the correct words, and of course the date format. If there are commas or other punctuation in the date the search pattern will need to be changed (probably). Of course, it could be made to look for almost any information in a text file with the proper massaging.
Screenshot: https://drive.google.com/file/d/11_fEXLTVgVMdCO1z37HhszslCRT o6NJB/view?usp=sharing
The script:
//--------------------------------------------
// PRE-BATCH:
let fName = fileText = monthNum = lineText = "" ;
let yearLoc ;
let dateStrip ;
let test ; let fileSearchLoc ;
let lineArr = [] ;
const locatorStr = "Statement Date " // CHANGE THIS TO LOCATE OTHER TEXT STRINGS
const dateLocatorPattern = new RegExp( locatorStr, "i" ) ;
const locatorLength = locatorStr.length ;
lineArr = f_FindStatementDate() ;
function f_FindStatementDate() {
// Parse each file, extract the information, put in array lineArr;
for ( let j = 0; j < app.itemCount; j++ ) {
// Make file+path fullname:
fName = app.getItem(j).path + app.getItem(j).name + app.getItem(j).extension ;
// Read the entire file into the var fileText:
fileText = app.readFileText( fName ) ;
// Locate the data you want:
fileSearchLoc = fileText.search( dateLocatorPattern ) + locatorLength ; // "15" is number of characters in locator string
if ( ! fileSearchLoc ) {
lineArr[j] = "" ;
continue ;
}
if ( fileSearchLoc == -1 ) {
lineArr[j] = "" ;
continue ;
}
//Extract the info (this just brute-forces the next 25 characters after the search-phrase,
// enough to cover any date format):
fileText = fileText.substring( fileSearchLoc, fileSearchLoc + 25 ) ;
// Find the end of the year in fileText (this assumes year is last in date):
yearLoc = ( fileText.search( /\d{4}/ ) + 4 ) ;
// Remove everything after the year:
dateStrip = fileText.slice( 0, yearLoc ) ;
// Find the month in text (in my bank statement, anyway):
test = dateStrip.match( /[A-Za-z]+/ ) ;
// Send the text month to function f_testMonth() to get the month number as a string:
monthNum = f_testMonth( test[0] ) ;
// Replace the month text with month number string:
dateStrip = dateStrip.replace( /(\w+) (\d+), (\d{4})/, "$3"+monthNum+"$2" ) ;
// Add the processed date to lineArr[]:
lineArr[j] = dateStrip ;
}
return lineArr ;
}
// FUNCTION f_testMonth() - convert month name text to month number (string):
// This would need reworking for other languages/conventions
function f_testMonth( str ) {
let retString ;
switch( str ) {
case "January" :
case "Jan" :
retString = "01" ;
break ;
case "February" :
case "Feb" :
retString = "02" ;
break ;
case "March" :
case "Mar" :
retString = "03" ;
break ;
case "April" :
case "Apr" :
retString = "04" ;
break ;
case "May" :
retString = "05" ;
break ;
case "June" :
case "Jun" :
retString = "06" ;
break ;
case "July" :
case "Jul" :
retString = "07" ;
break ;
case "August" :
case "Aug" :
retString = "08" ;
break ;
case "September" :
case "Sept" :
case "Sep" :
retString = "09" ;
break ;
case "October" :
case "Oct" :
retString = "10" ;
break ;
case "November" :
case "Nov" :
retString = "11" ;
break ;
case "December" :
case "Dec" :
retString = "12" ;
break ;
}
return retString ;
}
//--------------------------------------------
//--------------------------------------------
// MAIN:
// BANK_STATEMENTtxt2pdf.js.aren
// 1. Make any changes needed in Pre-Batch, or here, like different
// search string. (save changes)
// 2. OCR your documents (my pdf editor,PDF Studio, has a "batch"
// function that allows me to ocr any number of PDFs at one
// time. It automatically names the text files with
// the base name of the .pdf - which is what is needed.)
// 3. Load this script, then load your pdf and txt files, making
// sure to load them so that the text files are processed first.
// 4. Make sure "Pair renaming" is checked in the menu.
// Everything is already done; all that's needed is to
// retrieve the correct element in lineArr[] and return it
// formatted to the old filename (with any other formatting
// needed, i.e., if you want dashes between YYYY-MM-DD):
dateNew = lineArr[item.index] ;
return item.newBasename + "_" + dateNew ;
//--------------------------------------------
Let me know if I can help...
Best,
DF