enRichMyData will deliver its capabilities as a set of interoperable tools and services that will form the enRichMyData Toolbox. The figure above provides the conceptual architecture of the toolbox. At the center lays the notion of a data enrichment pipeline that receives input data to be enriched and data to enrich with (left hand side), and generates enriched data (right hand side). The enrichment process is supported by:
enRichMyData, being a toolbox with loosely coupled but interoperable tools and services is meant to handle complex data enrichment scenarios, where tools and services can be combined and customized as needed. Each main tool or infrastructure service is supported by at least one software component. Below we give a short description of the available components for each tool and service infrastructure category, including references to any GitHub repositories.
DiscoverR
DiscoverR assists users in searching datasets, ontologies, and enrichment services and provides insights on their content to support their use in enrichment pipeline. The user can search for keywords over descriptions of catalogued datasets/ontologies/services (particular features such as metadata, formats, ontology terms, quality features), or browse specific descriptions from a visual interface. Catalogued datasets/services/ontologies encompass well-established knowledge bases (e.g., WikiData, DBpedia, Schema.org) and data used in and created with pipelines. DiscoverR provides semantic data profiling techniques to enrich basic descriptions based on metadata (e.g., DCAT) with ontology usage patterns (e.g., connections between concepts) and statistics (e.g., frequency, cardinality, etc.). These profiling techniques are applied to semantic sources, as well as “non semantic” sources, which are profiled by inferring the semantics of their schema exploiting annotation services of the LinkR component. Profiling is compliant with and boosts FAIR principles.
WrappR
WrappR provides data access using a virtual semantic layer and ensures secure access. WrappR is delivered as a semantic graph database with efficient reasoning, cluster and external index synchronization support. It provides a variety of different type of APIs and access methods as well as different types of data federation and virtualization. Through semantic data access and integration, WrappR provides a practical, robust and versatile tool to improve access to data.
CleanR
CleanR supports the specification of data manipulation transformations, including data cleaning operations and the generation of knowledge graphs from various data formats. Users specify transformations interactively from a user interface, while specifications will be stored in a machine-readable format to be replicated and reused. CleanR provides a broad set of AI-enabled data transformations (e.g., ML-based recommendations) and integrates them with generic linking and extension functionalities provided by the ResourcR. CleanR enables data cleaning and enrichment operations to be shared (as asset, text or executable), managed and, if needed, incorporated as steps in the data pipelines in the ScalR component.
RMLMapper executes RDF Mapping Language (RML) rules (https://rml.io/specs/rml/) to generate Linked Data from multiple originally (semi-)structured data sources.
LinkR
LinkR provides capabilities for semantic annotation of structured and semi-structured data using reference knowledge graphs and category schemes. Annotations consist of links from elements of the input data to elements of well-established knowledge bases and ontologies (e.g., WikiData, DBpedia, and Geonames), or user defined knowledge graphs made available through the ResourcR (including schema-level annotations ontology terms, and instance-level annotations with identifiers). LinkR supports annotations through intelligent ML algorithms recommending annotations and a human-in-the-loop approach enabling fine-tuning the recommendations algorithms and revision of the results, ensure high-quality annotations while minimizing the users' effort even on very large data volumes. Annotations will be converted into data transformations, to be used as part of enrichment pipelines.
StructR
StructR is the counterpart of LinkR for unstructured data. It generates structured data from the unstructured input text through semantic annotation, linking and extension. The text is processed by linguistic and semantic tools and concept mentions are identified and disambiguated from context. Furthermore, the text is projected into an embedding space by using representation learning. StructR supports a range of different pre-computed embeddings to represent the text and expand the dataset. Extension with custom annotation services is supported through a labeling interface for creating and editing text annotations which can then be used to build new annotation models in a human-in-the-loop fashion.
ClassifieR
ClassifiR supports data classification as a service and complements StructR. Whereas StructR identifies properties of parts of the text, ClassifiR labels the documents as a whole. The labels can be part of standard taxonomies, industry classifications, as well as custom sets of labels for which a classifier is built. Custom classification is supported by an interactive graphical interface which allows users to explore a document corpus and create ontologies through clustering, labelling and querying. ClassifiR automates the classification process and exposing it through a common endpoint independent of the classification used.
ResourcR
ResourcR provides infrastructure components to support the creation of linking services for a given dataset from a data provider as well as access mechanisms such as search and query. ResourcR enables performant linking and search functionalities with limited effort and expose them as search and linking APIs. The combination of ResourcR and LinkR makes it possible to turn semantic data produced with the toolbox into resources immediately available for reuse.
ScalR
ScalR provides infrastructure components for executing cleaning, transformation and linking at large scale. ScalR provides horizontal scalability of data enrichment pipelines using software containers, and support for management of the different procedures associated with the execution of data enrichment pipelines flexibly on heterogeneous computing infrastructures. ScalR provides integrated support for specific data enrichment operations in the form of a pipeline through the development of reusable standard templates for setting up such pipelines. ScalR promote the reuse and modification of existing data enrichment pipelines by exposing them as an integrated deployable unit, as opposed to ad-hoc, non-reusable pieces of code.
ReusR
ReusR provides infrastructure components for search and recommendation of assets (e.g., datasets, transformations, etc.) related to setting up and running data enrichment pipelines. It provides user login to data management assets, public/private access to assets and editing, sharing and versioning them. ReusR enables users of the enRichMyData toolkit to edit pipelines and promotes their reuse across use cases.
StreamR
StreamR provides infrastructure components for streaming support in data enrichment pipelines. It pipes data streams from/to appropriate endpoints and ensures high throughput, providing a configurable set of tools for setting up custom streams for new applications.
GreenR
GreenR provides infrastructure components to support monitoring of data enrichment pipelines in terms of their environmental impact. It monitors the carbon footprint of the various components in the pipeline and provides the results to the use through a dashboard to log and modulate the environmental impact due to the heavy computations within the pipelines.