Overview:
the user

[Back to Architecture]   [What AKIRA can understand]   [How to express a query]

What AKIRA can understand

If AKIRA is not limited by a given imposed organization of data (as it is in the source-driven approach), then its limitations lie in its extraction capabilities, expressed as concepts, meta-concepts and attributes. The current AKIRA system offers the user a limited range of concepts to express his query. Today, the concepts known by the system are Conference, Date and Location. Each concept corresponds to an IE tool.

concept class Conference
Most of IE tools are domain and format dependent. AKIRA's conference extractor is formatted to extract conference names from a source consisting of Calls for Papers. The first step consists in recognizing in a document the strings of characters corresponding to conference names (and tag them in the document). The second step is a co-reference step. Different strings of characters recognized to be conference names, may refer to the same conference. For instance the strings Int. Conf. on Very Large DataBases and VLDB refer to the same conference. The identification step follows. To each set of co-referent strings correspond an instance of class Conference. A canonical representation is chosen for the name of conferences. An instance of concept class Conference is described with its canonical representation which consists of the acronym of the conference followed by the corresponding year. For instance, "SIGMOD'98", "SIGMOD'99", etc. Moreover, each document (call for papers) is associated to a conference (or split into several documents when it is the call for papers of several conferences). The mapping from documents to instances of class Conference is used by other extractors with the assumption that all information extracted from the document refer to the instance of Conference it refers to.
The canonical representation is indeed a key for the class. In the conceptual representation of extraction capabilities, the concept class Conference is defined with at least two attributes: fragment, which associates to each instance the set of all its fragments (the pairs [file,span] of all its occurrences in the documents), and attribute name, its key.

 concept class Date
The date extractor follows the same successive steps (recognition, co-reference and identification), but is not restricted to a source of Calls for Papers (it extract dates from any English textual document). The canonical representation of dates consists of three attributes: month, day and year, which constitute the key of the class.

 concept class Location
Like the latter, the location extractor is not restricted to a given source, but is based on a recognizer (many thanks to Breck Baldwin who let us use his recognizer) which uses list of all cities, states and countries in the world. The chosen canonical representation is composed of city, state and country, which constitute the key of the class.
 

In our conceptual representation, a concept class may be specialized by either valued attributes or abstract attributes (meta-concepts). Today, AKIRA provides extractors for the following information about conferences:

How to express queries

The query language available to the user is OQL-like. A query consists of a select ... from ... where expression. For instance the query "What are the conferences located in Paris?" is expressed by:
      select x.name 
      from   x in Conference
      where  x.location.city="Paris"
However, within the limits of the system's knowledge, some fuzziness is allowed. The user can express his query using her own vocabulary. For instance the following expression is equivalent to the previous (and thus will have the same output).
      select x.name 
      from   x in Conf 
      where  x.location="Paris"
AKIRA uses a primitive thesaurus and some regular expressions to map class attributes from the user's query to attributes understood at the level of components. As an illustration, the attribute submission_deadline can be expressed by the system from any string matching the regular pattern:
      /^(conf(.*?))*(date(.*?))*sub(.*?)(date|deadline)*$/i

Some more advanced strategies are used for more sophisticated metaconcepts. When necessary, tools based on regular pattern as well as their syntactic description, including their part of speech such as SuperTags can be used.

If the user expresses himself through an expression that does not match the pattern, the system will not be able to understand the query.

[TOP]


Copyright AKIRA Project, University of Pennsylvania
Department of Computer and Information Science, Institute for Research in Cognitive Science