Components
SWSE comprises the following components:
Crawling
The crawler is used to retrieve documents from both the intranet and the Internet, storing them locally and providing cleansing functionality. We use a pipelined crawling architecture which is able to syntactically transform data from a variety of sources (e.g., HTML, XML) into RDF for easy integration into a Semantic Web system.
Object Consolidation
On the Semantic Web, URIs are used to uniquely identify entities. However, on the web, URIs may not be provided or may conflict for the same entities. We can improve the linkage of the data graph by resolving equivalent entities. For example, we can merge equivalent entities representing a particular person through having the same values for an email property or a social security number.
Indexing
A system dealing with vast amounts of information requires the ability to perform fast lookups to provide quick response times for queries. The indexing component provides sorted, blocked and compressed files for our read-optimised index structure. The indexes can be distributed across multiple machines to provide good scale-up properties.
Query Processing
The query processor component creates and optimises the logical plan for answering both interactive browsing and structured queries. The component is able to execute the plans over the network in a parallel multi-threaded fashion, accessing the interfaces provided by the local indexing components resident on the network.
Ranking
To score importance and relevance of results during interactive exploration, we use a links analysis technique which is used to simultaneously derive ranks of entities and data-sources. Ranking is an important addition to search and query interfaces and is used to prioritise presentation of more pertinent results.
User Interface
To provide user-friendly search, query and browsing over the data indexed, we provide a user interface which is the human access point to the Semantic Web Search Engine. Users incrementally built queries to browse the data-graph - through paths of entity relationships - and retrieve information about entities.