Overview

[HOME]

The Agentive Knowledge-based and Information Retrieval Architecture (AKIRA) project wishes to combine together various techniques from database technology and information processing to provide the user with a powerful tool to query the Web.
 


[AKIRA's story]   [AKIRA's benefits]   [AKIRA's features]   [Architecture]   [Future work]

AKIRA's story

Three Characters

Our user wants to query the Web in a flexible and transparent way. He thus expects the system to provide information retrieval, information extraction and data manipulation. The Web, the most heterogeneous network one may think of, consists of highly structured sources (databases), providing a query language (wrapper), as well as poorly structured sources (HTML pages). The system AKIRA is user-oriented, agent-based and Web-aware.

The plot

Our user wants information about upcoming conferences such as query Q1:``Conferences with a submission deadline after July 31, 1998?'' or query Q2: ``Conferences located in the USA?''

A sad story...

With usual tools, our user will browse ``by hand'' all pages (or send the query ``Call for papers'' to a search engine such as Altavista in order to have a list of thousands of potentially relevant webpages about conferences), read them and extract, still ``by hand'', the relevant information in the content.

... with a happy ending

In AKIRA, the target structure consisting of the classes and attributes invoked in the query is extracted and sent to the View Factory. The target structure is first analyzed to determine a suitable schema to represent information in the smart-cache. The system is user-oriented in the sense that the cache is structured with respect to the given target structure, and optionally completed by several classes in order to call the right Information Extraction (IE) tools to populate the database. The View Factory sends three queries: (1) a view definition creates the new classes and attributes, (2) an Information Retrieval (IR) query is sent to the IR component in order to retrieve data from the Web, and (3) a query is sent to the database in order to populate it.

[TOP]

AKIRA's benefits

A unified framework for IR, IE and data-manipulation

AKIRA is an attempt to offer a unified interface for information retrieval (IR), information extraction (IE) and data manipulation.

 Most Web services today only provide information retrieval. The user has to express a boolean expression of keywords which have to be explicitly mentioned in the document he interested in. The output of a query to a search engine is a list of URLs, corresponding to documents satifying the criteria. The user has to load and read returned documents to extract information. Further data manipulation also has to be done by hand.

In AKIRA, the user expresses a query (today a OQL-like query, tomorrow a NL query). The evaluation of the query consists in retrieving relevant documents from the Web and extracting and storing information in an object-oriented database (smart-cache) which substitutes for the usual unstructured cache. The tasks of information processing (IR and IE) and data manipulation are thus automatically performed in a transparent way for the user.

A user-oriented approach

The organisation of extracted data should not be seen as a constraint the system imposes on the user. AKIRA's smart-cache must be structured according to the user's understanding. The AKIRA system does not provide a rigid global schema, but small schema components representing its extraction capabilities and that can be combined together to match the structure expressed by the user.

 When the user sends a query, the system first computes the target structure (i.e. the schema that is expressed by the user in his query) and uses it as a schema for the cache. Of course, there is no magic, and the system is able to answer a query when the structure expressed by the user matches its extraction capabilities. However, the system allows flexibility with a fuzzy-matching module in charge of mapping the user's vocabulary to its own.
 
 

Possible future extensions

A Natural Language (NL) interface allowing the user to express himself with an approximate knowledge of the available schema components, and thus express fuzzy queries that will be translated into OQL query with general path expressions.

 The query evaluation can be optimized with query rewriting. First, the last step of AKIRA query evaluation can be optimized at the level of the database system: when the cache is populated. The latter corresponds to usual query optimization based on query rewriting. But AKIRA offers several other interesting ways that should be investigated for optimization.

Tomorrow the system should be able to mimic human browsing when processing documents. To perform such a task, IE tools should be first extended to HTML (and upcomming XML) syntax, then to the hyperstructure of Web documents (as opposed to the linear structure of usual documents). Browsing is a non-deterministic process that can be seen as a loop from (1) type a URL (or click) to (2) analyse the retrieved document, and back to (1). Heuristics that mimic human being should take advantage of explicit structure such as the tags or the structure of URLs, as well as implicit structure extracted from raw data.

 The conceptual representation of extracted information is format and medium independant. AKIRA's architecture makes it easy to plug in tools that process other formats or media and extract information from documents.

 Quality of service could be improved with providing information about recall and precision as a measure of confidence together with the extracted information.

 Lastly, the user should be able to define the output format of the result. AKIRA could be combined with a HTML (or XML) restructuring tool.

[TOP]

AKIRA's features

Mutliplatform

The AKIRA system has been fully written in Perl and Java and can run on any platform supporting these languages.

Small foot-print

AKIRA's cache has been implemented using PSE embedded database from ObjectStore and requires very little memory. The detail is as follows:
OQL parser      ~   30kb
AKIRA           ~   50kb
PSE             ~  400kb
Regular Exp     ~  100kb
JGL             ~  560kb (*) we are only using a limited set of classes
------------------------
                ~ 1200kb

Modular and extensible

AKIRA is modular. Services can be directly plugged-in into the system.
Since each component inside the system follows the same interface, the system knows how to build complex classes out of their sub-components.

 AKIRA provides a template to write new services. Most services are composed of a Perl script (used for extraction) and its corresponding Java wrapper (used for object construction and database storage). The system offers some generic tools for extraction as well as object construction.

 As an illustration, here are some pieces of the Java and Perl programs for the component "Date of Submission".

Java program
class SubmissionDate extends DateComponent { public SubmissionDate(String confName) throws Exception { try { // create the system call String[] cmd = buildScript(confName); // launch the call and process the result process(confName, getProcStream(cmd), getCorefTable()); } catch(Exception e) { this.date = new BuggyDate(); } } private void process(String conf, InputStream is, CoreferenceTable corefTable) throws Exception { BufferedReader br = new BufferedReader(new InputStreamReader(is)); this.date = parseDate(br.readLine()); // date value br.readLine(); // empty line String str = br.readLine(); while (str!=null) { Span span = extractSpan(str); // doc start end corefTable.addSpan(conf, span.doc, span); str = br.readLine(); // date br.readLine(); // empty line str = br.readLine(); } return; } }
Perl program
#!/usr/bin/perl sub initFilter { print "Building filter...\n" if ($DEBUG); return sub { my $line = shift; my $context = " (subm .*? | paper .*? (due|rece) .*?) "; my $pattern1 = $context . $Date::dateFilter; my $pattern2 = $Date::dateFilter . $context; return ( ( ($line =~ /$pattern1/oxi) || ($line =~ /$pattern2/oxi) ) && ($line !~ /camera|abstract/oxi)) }; }

[TOP]

Future Work

Future work will probably include the following topics:
[TOP]




Copyright AKIRA Project1998, University of Pennsylvania
Department of Computer and Information Science, Institute for Research in Cognitive Science