Financial Statement Data Extraction

Our team has developed a web application that processes scanned PDF files, separating each page and extracting pre-configured data structures from it.

The problem

The main problem was the extraction and collecting of the needed financial data. As an official source of this information, the Bulgarian Trade Register offers only scanned documents with Annual Financial Reports, so we should develop a solution which extracts and converts the scanned data into Machine readable format for our model.

The solution

The system works with one or more, manually submitted or taken automatically from the commercial register, documents. The operator has the opportunity to track the whole process and make adjustments at each stage. In addition, our development offers setting rules and automatic data checks that can be configured by an operator, as well as event logs. The extracted data are lists of accounting codes for many years and companies converted into a format suitable for processing by other systems, which allows integration with various financial software.

Our team was responsible for

Identifying and gathering requirements
UI/UX
Implementing the web user interface
Integration with Machine Learning models

The client

Our client is one of the largest Bulgarian banks, where we are working on a project for a predictive ML model for Credit Risk Analysis. Our solution proceeds raw financial data in a specific format to generate a Risk coefficient.

Frontend technologies

web UI – Angular

How we used ML

To extract the data from scanned documents, we used OCR technologies (pytesseract). To process the raw data effectively, we use powerful machine learning tools and Python libraries. Large Language Models (LLM) are used for interpreting and structuring financial data.For efficient data handling and transformation – Pandas and NumPy.