enRichMyData Toolbox

enRichMyData will deliver its capabilities as a set of interoperable tools and services that will form the enRichMyData Toolbox. The figure above provides the conceptual architecture of the toolbox. At the center lays the notion of a data enrichment pipeline that receives input data to be enriched and data to enrich with (left hand side), and generates enriched data (right hand side). The enrichment process is supported by:

  • a set of tools (top of the figure) that provide functional capabilities needed to support the design of pipelines; and
  • a set of infrastructure services (bottom of the figure) that provide non-functional capabilities needed to support the effective and efficient deployment and execution of pipelines.

  • enRichMyData, being a toolbox with loosely coupled but interoperable tools and services is meant to handle complex data enrichment scenarios, where tools and services can be combined and customized as needed. Each main tool or infrastructure service is supported by at least one software component. Below we give a short description of the available components for each tool and service infrastructure category, including references to any GitHub repositories.



    DiscoverR

    DiscoverR assists users in searching datasets, ontologies, and enrichment services and provides insights on their content to support their use in enrichment pipeline. The user can search for keywords over descriptions of catalogued datasets/ontologies/services (particular features such as metadata, formats, ontology terms, quality features), or browse specific descriptions from a visual interface. Catalogued datasets/services/ontologies encompass well-established knowledge bases (e.g., WikiData, DBpedia, Schema.org) and data used in and created with pipelines. DiscoverR provides semantic data profiling techniques to enrich basic descriptions based on metadata (e.g., DCAT) with ontology usage patterns (e.g., connections between concepts) and statistics (e.g., frequency, cardinality, etc.). These profiling techniques are applied to semantic sources, as well as “non semantic” sources, which are profiled by inferring the semantics of their schema exploiting annotation services of the LinkR component. Profiling is compliant with and boosts FAIR principles.

    ABSTAT

    Scalable data profiling tool for RDF data (knowledge graphs) based on: 1) pattern extraction (class - predicate - class), possibly with the support of the data ontology; 2) calculation of different statistics. Access to the profiles by 1) full-text search over patterns; 2) browsing; 3) APIs

    Home GitHub License TRL

    SemTUI

    Table reconciliation and extension service: Semantic annotation of tabular data by external services. Entity reconciliation and linking. Table extension extracting data from external datasets and Knowledge Graphs.


    Home Document GitHub License TRL

    WrappR

    WrappR provides data access using a virtual semantic layer and ensures secure access. WrappR is delivered as a semantic graph database with efficient reasoning, cluster and external index synchronization support. It provides a variety of different type of APIs and access methods as well as different types of data federation and virtualization. Through semantic data access and integration, WrappR provides a practical, robust and versatile tool to improve access to data.

    Ontotext GraphDB

    Ontotext GraphDB is a highly efficient and robust graph database with RDF and SPARQL support. It supports a number of plugins and connectors such as MongoDB connector for JSON store access, JDBC for exposing RDF as a virtual relational DB, ONTOP for virtual sparql access.

    Home Document License TRL

    Ontotext Semantic Objects

    The Semantic Objects are a declaratively configurable service for querying and mutating knowledge graphs which automatically transpiles GraphQL queries and mutations into optimized SPARQL queries.


    Home Document License TRL

    Ontotext Semantic Search

    The Semantic Search provides a way to index the data from GraphDB in Elasticsearch and run queries against it.




    Home Document License TRL

    CleanR

    CleanR supports the specification of data manipulation transformations, including data cleaning operations and the generation of knowledge graphs from various data formats. Users specify transformations interactively from a user interface, while specifications will be stored in a machine-readable format to be replicated and reused. CleanR provides a broad set of AI-enabled data transformations (e.g., ML-based recommendations) and integrates them with generic linking and extension functionalities provided by the ResourcR. CleanR enables data cleaning and enrichment operations to be shared (as asset, text or executable), managed and, if needed, incorporated as steps in the data pipelines in the ScalR component.

    Ontotext Refine

    Ontotext Refine (OntoRefine) is a free application for automating the conversion of messy string data into a knowledge graph.



    Home Document License TRL

    RMLMapper

    RMLMapper executes RDF Mapping Language (RML) rules (https://rml.io/specs/rml/) to generate Linked Data from multiple originally (semi-)structured data sources.



    Home Document GitHub License TRL

    LinkR

    LinkR provides capabilities for semantic annotation of structured and semi-structured data using reference knowledge graphs and category schemes. Annotations consist of links from elements of the input data to elements of well-established knowledge bases and ontologies (e.g., WikiData, DBpedia, and Geonames), or user defined knowledge graphs made available through the ResourcR (including schema-level annotations ontology terms, and instance-level annotations with identifiers). LinkR supports annotations through intelligent ML algorithms recommending annotations and a human-in-the-loop approach enabling fine-tuning the recommendations algorithms and revision of the results, ensure high-quality annotations while minimizing the users' effort even on very large data volumes. Annotations will be converted into data transformations, to be used as part of enrichment pipelines.

    SemTUI

    Table reconciliation and extension services: semantic annotation of tabular data using external services. The UI supports entity linking and schema annotations to support full-fledge mapping, and specification of data extension operations using external datasets and Knowledge Graphs.

    Home Document GitHub License TRL

    selBat

    Table interpretation service: Semantic annotation of tabular data by an unsupervised approach based on heuristic. Schema types and properties, entity reconciliation and linking. Target Knowledge Graph: Wikidata.


    Home Document GitHub License TRL

    Ontotext Refine

    Ontotext Refine (OntoRefine) is a free application for automating the conversion of messy string data into a knowledge graph. It allows reconciliation against any endpoint supporting the reconcile API protocol.


    Home Document License TRL

    Ontotext Reconciliation

    Ontotext reconciliation generates a reconciliation API endpoint on top of an RDF knowledge graph.




    Home License TRL

    StructR

    StructR is the counterpart of LinkR for unstructured data. It generates structured data from the unstructured input text through semantic annotation, linking and extension. The text is processed by linguistic and semantic tools and concept mentions are identified and disambiguated from context. Furthermore, the text is projected into an embedding space by using representation learning. StructR supports a range of different pre-computed embeddings to represent the text and expand the dataset. Extension with custom annotation services is supported through a labeling interface for creating and editing text annotations which can then be used to build new annotation models in a human-in-the-loop fashion.

    Wikifier

    The JSI Wikifier is a web service that takes a text document as input and annotates it with links to relevant Wikipedia concepts (entities).



    Home Document License TRL

    Expert AI Platform Document Analyser

    With the Natural Language API's document analysis capabilities, you can perform deep linguistic analysis, keyphrase extraction, named entity recognition, relation extraction and sentiment analysis.


    Home Document License TRL

    Event Registry Relation Classifier

    The Event Registry Relation Classifier.





    Home Document License TRL

    ClassifieR

    ClassifiR supports data classification as a service and complements StructR. Whereas StructR identifies properties of parts of the text, ClassifiR labels the documents as a whole. The labels can be part of standard taxonomies, industry classifications, as well as custom sets of labels for which a classifier is built. Custom classification is supported by an interactive graphical interface which allows users to explore a document corpus and create ontologies through clustering, labelling and querying. ClassifiR automates the classification process and exposing it through a common endpoint independent of the classification used.

    InfoMiner

    The (semi-)automatic data exploration and topic ontology creation tool.




    Home Document GitHub License TRL

    Expert AI Platform Document Classification

    Document classification determines what a text is about in terms of categories of a taxonomy. Available taxonomies are: Iptc Media Topics, GeoTax, Emotional Traits, Behavioral Traits.


    Home Document License TRL

    ResourcR

    ResourcR provides infrastructure components to support the creation of linking services for a given dataset from a data provider as well as access mechanisms such as search and query. ResourcR enables performant linking and search functionalities with limited effort and expose them as search and linking APIs. The combination of ResourcR and LinkR makes it possible to turn semantic data produced with the toolbox into resources immediately available for reuse.

    LamAPI

    Entity search and candidate retrieval. Retrieval of properties associated with entities. Target Knowledge Graph: Wikidata, DBPedia.



    Home Document GitHub License TRL

    ScalR

    ScalR provides infrastructure components for executing cleaning, transformation and linking at large scale. ScalR provides horizontal scalability of data enrichment pipelines using software containers, and support for management of the different procedures associated with the execution of data enrichment pipelines flexibly on heterogeneous computing infrastructures. ScalR provides integrated support for specific data enrichment operations in the form of a pipeline through the development of reusable standard templates for setting up such pipelines. ScalR promote the reuse and modification of existing data enrichment pipelines by exposing them as an integrated deployable unit, as opposed to ad-hoc, non-reusable pieces of code.

    TAO

    TAO (stands for Tool Augmentation by user enhancements and Orchestration) is an open source, lightweight, generic, extensible and distributed orchestration framework. It allows to reuse (i.e. integrate) commonly used toolboxes.


    Home Document GitHub License TRL

    ReusR

    ReusR provides infrastructure components for search and recommendation of assets (e.g., datasets, transformations, etc.) related to setting up and running data enrichment pipelines. It provides user login to data management assets, public/private access to assets and editing, sharing and versioning them. ReusR enables users of the enRichMyData toolkit to edit pipelines and promotes their reuse across use cases.

    TAO

    TAO (stands for Tool Augmentation by user enhancements and Orchestration) is an open source, lightweight, generic, extensible and distributed orchestration framework. It allows to reuse (i.e. integrate) commonly used toolboxes.


    Home Document GitHub License TRL

    StreamR

    StreamR provides infrastructure components for streaming support in data enrichment pipelines. It pipes data streams from/to appropriate endpoints and ensures high throughput, providing a configurable set of tools for setting up custom streams for new applications.

    Event Registry

    Event Registry provides a UI and an API for media monitoring and media intelligence based on AI algorithms that constinuously analyse and group news articles.



    Home Document GitHub License TRL

    StreamStory

    StreamStory is a tool and service that ingests both sensor data and other timeseries. It automatically fuses and organizes it into a hierarchy with states and transitions between states.


    Home Document GitHub License TRL

    GreenR

    GreenR provides infrastructure components to support monitoring of data enrichment pipelines in terms of their environmental impact. It monitors the carbon footprint of the various components in the pipeline and provides the results to the use through a dashboard to log and modulate the environmental impact due to the heavy computations within the pipelines.

    Carbontracker

    Carbontracker is a tool for tracking and predicting the energy consumption and carbon footprint of training deep learning models.



    Home Document GitHub License TRL