Problem Statement :
PDF (Portable Document Format) is one the most used document format by organizations and individuals for exchanging of information. Countless business processes rely on manipulating PDF files for invoices, legal documents, reports, and a variety of other documents. Time has become critical in any enterprise and businesses. It’s also a crucial parameter when it comes to data extractions from a mass number of PDF documents.
Automation of extracting data from PDF files using Robotic Process Automation(RPA) tools is a standard practice followed these days. However, there are various challenges that comes with it in a practical scenario.
What if wrong data is captured from PDF? How do we identify it? Is confidence-score reliable enough for it? What if there are multiple formats of the input file and the bot does not know which format of PDF it is processing at the moment? Can it detect by itself?
Let’s address one of the problem in this writing.
Imagine if client wants to extract information from PDF of invoices. And the client has multiple customers having different formats of invoices and we may not be able to tell each of the client’s customers to have a uniform format so as to ease the automation. Then what do we do?
Solution :
Let’s say client has a total of 10 formats of invoice formats. And the customers randomly sends their invoices. When we execute the bot, the bot does not know which one of the 10 formats the input file would be of. And we’re not expecting the customers of the client to help us with it. So, what we can do is, we can create and pre-save the histogram of each of the templates of invoices and find the Euclidean Distance between them and the histogram of input file during the execution. Then after comparing, the minimum distance between input file’s histogram and the other templates’ histograms would give us which template the input file has matched the most. This can all be done swiftly through Python script.
Now, the RPA bot knows which template the input file is following and hence can directly use that format’s pre-defined rules to extract the data from input file.