MiKroloG: The Microdata Knowledge Graph

Context

Searching the web has changed our daily lives; documents on the web containing a list of keywords can be found in a snap. Then, users wanted to find Things, not strings. Thanks to knowledge graphs (KG), Searching the web has changed our daily lives; documents on the web containing a list of keywords can be found in a snap. However, keyword searches often return many irrelevant documents, pushing users to refine their keyword list following a trial-and-error process. Then, users wanted to find Things, not strings. “Paris” as a string refers to the capital of France, but also a city in Canada, in Arkansas, a movie, a band, a ship, a person, or a  manuscript. Considering languages, “Paris” means a “bet” in French. Searching for Things means a keyword query returns a Thing or a collection of Things instead of documents. Searching for Things requires a knowledge graph storing entities representing Things, i.e., an instance of a concept such as a person, a place, an event, etc. For example, there is one entity for Paris as a city and another for Paris as a manuscript of 1844. Thanks to these knowledge graphs, users who request movies of James Cameron receive a list of movies where James Cameron and his movies are Things, i.e., entities defined in the KG. 

However, searching the web and searching for Things are entirely different. Searching the web offers diversity at the price of noise. Searching for Things delivers exact answers, but we lose diversity. Is there a way to have diversity without noise?

Objectives

Slides of cominlabs days 25-27 September 2023

In MiKroloG, we aim to search the web with Things. As KG returns a collection of entities, such entities may be used to retrieve web pages that refer to them. In this way, it is possible to have both diversity and accuracy. For instance, we may search the websites selling “James Cameron movies” ordered by price and rating or for experimental data related to the COVID pandemic published by UK public universities ordered by date. These queries first explore existing knowledge graphs to retrieve collections of things like “UK public universities” and “James Cameron movies”. Then, we explore which “commercial websites” or “university web pages” refer to these things, i.e., we searched the web with Things. 

Searching the web with Things requires a close connection between the web of documents and Knowledge Graphs. Currently, this connection is partially powered by the embedding of microdata in web pages. Half of the web pages integrate microdata describing people, places, organizations, events, products, and drugs following the  Schema.org ontology. This represents billions of facts spread over millions of constantly evolving websites. Google Dataset Search relies on microdata to search for datasets on the web, Google Shopping relies on microdata to feed its marketplace and search for products.

To search  the web with Things, we face three main scientific challenges:

  • Users are used to searching using keywords. Transforming a keyword query into a knowledge graph query is difficult, especially for complex queries.
  • As with traditional web searches, users expect to get ranked results in a snap. It is very challenging to provide top-ranked results for complex queries on large knowledge graphs.
  • Microdata provides some links between web pages and knowledge graph entities, but these links must be computed by solving the problem of matching microdata to knowledge graph entities. Performing entity matching at a large scale between microdata in web pages and knowledge graph entities is challenging.

Proposal

Comments are closed.