Results – MiKroLog

We detail the results of three tasks of the MiKroloG:

Task 1: Entity matching
Task 2: Querying and Ranking
Task 3: Querying the microdata-KG as end-users

Task 1: Entity Matching

We aim to perform entity matching from web entities in schema.org datasets to knowledge graphs.

Two important results:

Schema.org: How is it used?
FedShop: A Benchmark for Testing theScalability of SPARQL Federation Engines

Schema.org: How is it used? (Poster at ISWC 2023)
- Approach: Our approach relies on characteristic sets that describe semantically similar entities by grouping them according to the set of properties of the entities.
- Contributions: We computed characteristic sets (CSets) for the JSON-LD dataset (most used format) of WebDataCommons (October 2021). The CSets are available at (https://doi.org/10.5281/zenodo.8167689) and are used as a basis to answer different questions. All results are available at (https://schema-obs-demo.onrender.com). To analyze the schema.org dataset composed of 6.7B web entities, we used an 8-node HPC cluster (8 CPU threads, 32 GB of RAM, 20 GB of local storage per node). We computed 4,638,824 CSets, which took around 30 hours.

FedShop: A Benchmark for Testing the Scalability of SPARQL Federation Engines (paper at ISWC2023)
- Approach: We proposed the first synthetic scalable benchmark that contains aligned web entities, i.e., it is possible to control the number of knowledge graphs and the number of the “sameAs” relations.
- Contribution: FedShop is a novel benchmark designed for scalability experiments. FedShop captures an e-commerce scenario with a scalable federation of online shops and rating sites and query workloads that simulate users who explore and search for products and offers
  across the federation. More specifically, the benchmark consists of the following:
  - Ten pre-generated federations ranging from 20 to 200 federation members,
  - a schema-based dataset generator to generate further federations for which the scale factor is the number of federation members and for which the distribution
    law of every relationship of the data schema can be configured,
  - 12 query templates capturing different, use-case-specific types of queries,
  - a collection of ten such queries per template (i.e., 120 queries overall).

Task 2: Querying and Ranking

Processing SPARQL TOP-k Queries Online with Web Preemption (QuWeDa@ISWC2022)

Processing top-k queries on public online SPARQL endpoints often runs into
fair use policy quotas and is incomplete. Indeed, existing
endpoints mainly follow the traditional materialize-and-sort strategy.
Although restricted SPARQL servers ensure the termination of top-k queries
without quotas enforcement, they follow the materialize-and-sort approach,
resulting in high data transfer and poor performance.

We propose to extend the Web preemption model with a preemptable
partial top-k operator. This operator drastically reduces data transfer and
significantly improves query execution time.
Experimental results show a reduction in data transfer by a factor of 100 and
a reduction of up to 39% in Wikidata query execution time.

RAW-JENA: Approximate Query Processing for SPARQL Endpoints (Demonstration at ISWC 2023 2023)
Sampling-based Approximate Query Processing (S-AQP) has many important use cases for RDF, including computing large-scale statistics, embeddings, join orderings, approximate aggregations, summaries, and exploratory queries. However, current SPARQL endpoints have no support for S-AQP, and many queries just time out on public SPARQL endpoints. We propose RAW-JENA: an extension of Apache Jena to support S-AQP for conjunctive SPARQL queries relying on random walks. RAW-JENA delivers partial random results and cardinality estimates in a pay-as-you-go fashion.

RAW-JENA in action (Demonstration at ISWC 2023)

Task 3: Querying the microdata-KG as end-users

A regular user uses Sparklis to formulate a complex query (Poster at GDR TAL 2022)
- Context and motivation: Controlled language is inherently natural, precise, and expressive. Moreover, keyword search is available. However, heterogeneous use of schema in microdata, noisy data, therefore, Tedious/difficult navigation in such data. We need more spontaneous user interaction, letting the machine navigate, etc…
- Proposal: Natural language interaction system
  - Objective: Spontaneous natural language interactions and autopilot mode for Sparklis

Natural language for querying Knowledge Graph (ECAI2023)

Results: Language Models as CNL for Knowledge Graph Question Answering