Project Description

Our team has developed a web application that processes scanned PDF files, separating each page and extracting pre-configured data structures from it.
The problem
The main problem was the extraction and collecting of the needed financial data. As an official source of this information, the Bulgarian Trade Register offers only scanned documents with Annual Financial Reports, so we should develop a solution which extracts and converts the scanned data into Machine readable format for our model.
The solution
The system works with one or more, manually submitted or taken automatically from the commercial register, documents. The operator has the opportunity to track the whole process and make adjustments at each stage. In addition, our development offers setting rules and automatic data checks that can be configured by an operator, as well as event logs. The extracted data are lists of accounting codes for many years and companies converted into a format suitable for processing by other systems, which allows integration with various financial software.

Our team was responsible for
- Identifying and gathering requirements
- UI/UX
- Implementing the web user interface
- Integration with Machine Learning models

The client
Our client is one of the largest Bulgarian banks, where we are working on a project for a predictive ML model for Credit Risk Analysis. Our solution proceeds raw financial data in a specific format to generate a Risk coefficient.
Frontend technologies
web UI – Angular |
How we used ML
To extract the data from scanned documents, we used OCR technologies (pytesseract). To process the raw data effectively, we use powerful machine learning tools and Python libraries. Large Language Models (LLM) are used for interpreting and structuring financial data.For efficient data handling and transformation – Pandas and NumPy.
Backend technologies
Java | |
Python | |
ML Models |
