Overview: optimization

[Back to Overview] [Query rewriting] [Combining IR] [Object View] [Combining IE]

Query rewriting

Standard techniques in optimization are based on query rewriting. It consists in associating a logical equivalent expression to a given query expression that is less costly to evaluate. In the AKIRA approach, the query expression given by the user is used to (1) retrieve documents from the Web, (2) build the structured cache (database) and (3) query the database. Optimization can be done at each of these steps. The last one, would only consists in using standard optimization techniques for OQL queries.

[TOP]

Combining with IR

Many directions should be investigated in order to optimize query evaluation in the AKIRA approach at the level of the retrieval of information from the Web.

A "populating class" is extracted from the user's query. With the restricted pool of schema components available today in AKIRA system, the only "populating class" is the class Conference. To each "populating class" is associated a parameterized IR query.

A IR query access relevant Web services (search engine, database...) to retrieve information. Today, the only IR query attached to the class Conference is a query that retrieve Calls for Papers, since the information about conferences available in the conceptual representation (dates, location, PC members, URL, etc.) is extractable from Calls for Papers. Tomorrow, the system could be extended to other information about conferences, such as registration_fee, program, etc. extractable from Calls for Participation. In order to do so, the IR query associated to the class Conference should also retrieve Calls for Participation. The IR query sent to the Web will depend on the user's query the following way. The Dispatcher will have to find for each attribute of the target structure the right source of documents its associated IE tool is able to extract information from. For instance, the dates of the conference can be extracted from both the Call for Papers and Call for Participation (with the same IE tool) when the deadline for submissions can only be extracted from the Call for Papers, and the registration fee from the call for Participation. The evaluation of the query is a little bit more complicated since several sources of documents may be retrieved respectively processed by IE tools to extract information necessary to populate a single structured cache.

The IR component addresses several issues in optimization. First the right service has to be chosen to retrieve information from the Web. Should it be decided once? Should the system allow some flexibility and access several competitive services and thus deal with redundant information with different format (warehouse)? Then for each of these services the right query has to be asked. What is the best combination of keywords to obtain good recall and precision with this search engine? What are the relevant parameters that should be added in order to better filter the retrieval step according to the user's need?

[TOP]

Defining the Object View

The second use of the user's query consists in defining the object view itself. The interface allows the user to express oneself with fuzzyness, path expressions. When processing the query, several equivalent (with respect to the user's query) target structure may be used to answer the query. The system should choose the the less expensive one in terms of recall/precission (choose the classes for which IR queries are good), with respect to the data already available in the current cache, with respect to the costs of IE tools, etc.

[TOP]

Combining with IE

The fragment representation may be used for further optimization. When interested in extracting information about given particular instances of concepts (such as SIGMOD for concept Conference), why processing the whole underlying understructured cache? Only files that mention the given instances should be processed. Each instance of a concept is associated to the list of its fragments. It then easy to only process files whose name appear in the first component of these fragments. The second component (span) of fragments, may also be used by extractors which are based on a notion of locality. In this latter case, why should an extractor process the whole file when it is only necessary to process a zone around a given string of characters, and when this particular string is directly accessible in the document with the span? The use of fragments may improve the extraction capabilities of the system and thus optimize evaluation of queries.

[TOP]

Copyright AKIRA Project, University of Pennsylvania
Department of Computer and Information Science, Institute for Research in Cognitive Science