Fully automated data curation

We provide a fully automated data curation service which can be used to enhance the quality of raw clinical, genetic and lifestyle data. The service supports functionalities for metadata extraction, outlier detection, data imputation, and de-duplication. 

Paper with Clip

Flexible performance and scale

Both conventional methods, such as, the z-score and the interquartile range, as well as, advanced methods, such as, the isolation forests and the Gaussian elliptic curves are used to detect anomalies within the raw data.

Tablet

Advanced data imputers

Advanced data imputers are used to replace the missing values based on virtual patient profiles which are built for each real patient. Similarity detection methods and lexical matchers are used to identify variables with identical distributions and common meaning. Useful metadata are provided to the user along with detailed reports where the inconsistent fields are highlighted with color coding

curation

Pezoulas VC, Kourou KD, Kalatzis F, Exarchos TP, Venetsanopoulou A, Zampeli E, Gandolfo S, Skopouli F, De Vita S, Tzioufas AG, Fotiadis DI. Medical data quality assessment: On the development of an automated framework for medical data curation. Comput Biol Med. 2019 Apr;107:270-283. doi: 10.1016/j.compbiomed.2019.03.001. Epub 2019 Mar 7. PMID: 30878889.

why choose us

Hybrid data harmonization

We have a long standing experience in handling data and cohorts from different clinical centers and in performing data harmonization.

image

Heterogenity of data

We offer a hybrid data harmonization service which uses advanced lexical and semantic analyzers to detect terminologies with both conceptual and lexical basis aiming to overcome structural heterogeneities across retrospectice medical data from diverse medical centers. To this end, we use WebProtege to transform conventional data models into ontologies (hierarchical data models) towards the definition of reference ontologies for different diseases with respect to the HL7 FHIR and FAIR principles.

image

Reference ontologies

The reference ontologies are enriched with word embeddings and additional terminologies from Python's NLTK and external vocabularies (e.g. OHDSI Athena) to define a large corpus. The corpus is finally deployed to identify terminologies with common conceptual and/or lexical basis in the raw databases per domain of interest and the values are standardized according to the clinical practice.

Our key differentiators

image

easy to use

The data harmonization workflow is depicted in Fig. 1 and consists of 4 stages, including the: (i) metadata extraction and relational modeling, (ii) construction of reference semantic data models (ontologies) for the CVD and mental disorders, (iii) development of medical dictionaries by interlinking the word embeddings from the ontologies with external knowledge repositories, such as, the OHDSI (Observational Health Data Sciences and Informatics) [15], and (iv) lexical and semantic analysis. The latter are built on top of the dictionaries to identify terminologies with lexical and conceptual basis. The output is a data harmonization report which includes the matching scores for each identified terminology along with useful metadata.

V. C. Pezoulas et al., “A hybrid data harmonization workflow using word embeddings for the interlinking of heterogeneous cross-domain clinical data structures,” 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Athens, Greece, 2021, pp. 1-4, doi: 10.1109/BHI50953.2021.9508484.
Skip to content