Datlowe blog

Publishing our work.


Reading documents with Aspose

One of the biggest obstacles can be the first step in our analysis pipeline - to prepare data for analysis. The data of our customers is stored in different document formats, e.g. machine readable PDFs, scans, MS office documents like words, excels and others. There are two scenarios how to deploy our pipeline. We can run the pipeline in the cloud or in the environment of our customers. A setup of the pipeline should be as simple as possible and without any additional license costs for our customers.

We had a solution for document transformation implemented in .NET on Windows machines with dependency on MS Outlook and MS Office products. In order to automate this very important step in the pipeline and make it scalable, we wanted to build this as an asynchronous, highly scalable microservice without heavy dependencies.

The Aspose SDK provided us a Java API which we are able to run without the dependency on Windows environment. We are now able to build lightweight text extraction and file conversion microservice, packaged as a docker image. It is now possible to run these microservices in a highly parallel and scalable way in our Apache Mesos environment. It is also very simple to setup a docker environment and deploy these microservices in the environment of our customers.

We have implemented basic functionality of office documents converter by ourselves in the past and tried several other free and paid solutions. But we found out the Aspose.Total for Java be the clear winner in this case. It has a lot of functionality, is easy to use with clear documentation and no dependency on Windows or native Office .NET API. It is easily scalable and embeddable, and has convenient license policy. We are also using the functionality of text extraction from machine readable PDF and Word files with metadata.

Next Steps
We plan to experiment with the OCR-library and try to collaborate with Aspose to provide support for East European Languages. In the next coming months, we are going to implement the document generation as well and we are looking forward to use Aspose for this scenario too.

We see Aspose library as the key driver of several of our core components in document analysis pipeline. It is easy to recommend Aspose to anyone who has a need to deal with different Microsoft Office and PDF documents in different scenarios ranging from text extraction to document conversion.

latest blog posts

6 Reasons Why Keywords…

Listening to the following words may become a little tiresome: "I've heard you have a solution which can read unstructured texts. Can you tell me which keywords do you use?" It was a reason why we…
latest blog posts

Highlights from recent…

Results of an European survey on healthcare-associated infections (HAIs) has been published recently. In 2016 and 2017, the European Centre for Disease Prevention and Control (ECDC) conducted two…
latest blog posts

Launching new website

We have prepared a fresh new website! It is simple, reflects our current activities in healthcare analytics, and introduces our product HAIDi, which brings NLP and machine learning into surveillance…