Datlowe blog

Publishing our work.

blog
2017-03-10

Reading documents with Aspose

One of the biggest obstacles can be the first step in our analysis pipeline - to prepare data for analysis. The data of our customers is stored in different document formats, e.g. machine readable PDFs, scans, MS office documents like words, excels and others. There are two scenarios how to deploy our pipeline. We can run the pipeline in the cloud or in the environment of our customers. A setup of the pipeline should be as simple as possible and without any additional license costs for our customers.

We had a solution for document transformation implemented in .NET on Windows machines with dependency on MS Outlook and MS Office products. In order to automate this very important step in the pipeline and make it scalable, we wanted to build this as an asynchronous, highly scalable microservice without heavy dependencies.

Solution
The Aspose SDK provided us a Java API which we are able to run without the dependency on Windows environment. We are now able to build lightweight text extraction and file conversion microservice, packaged as a docker image. It is now possible to run these microservices in a highly parallel and scalable way in our Apache Mesos environment. It is also very simple to setup a docker environment and deploy these microservices in the environment of our customers.

Experience
We have implemented basic functionality of office documents converter by ourselves in the past and tried several other free and paid solutions. But we found out the Aspose.Total for Java be the clear winner in this case. It has a lot of functionality, is easy to use with clear documentation and no dependency on Windows or native Office .NET API. It is easily scalable and embeddable, and has convenient license policy. We are also using the functionality of text extraction from machine readable PDF and Word files with metadata.

Next Steps
We plan to experiment with the OCR-library and try to collaborate with Aspose to provide support for East European Languages. In the next coming months, we are going to implement the document generation as well and we are looking forward to use Aspose for this scenario too.

Summary
We see Aspose library as the key driver of several of our core components in document analysis pipeline. It is easy to recommend Aspose to anyone who has a need to deal with different Microsoft Office and PDF documents in different scenarios ranging from text extraction to document conversion.

latest blog posts

Launching new website

We have prepared a fresh new website! It is simple, reflects our current activities in healthcare analytics, and introduces our product HAIDi, which brings NLP and machine learning into surveillance…
latest blog posts

Processing medical…

Have you ever thought about the amounts of data, which are produced every day in each hospital? All those anamneses, daily diagnoses, laboratory reports, nursing records, X-ray reports, patient’s…
latest blog posts

Mining legal documents

Common statistical or data mining methods cannot cope with unstructured data (e.g., texts, images or videos) and the business analyses are often based on the structured information only. The…