semantics

Processing documents with a semantic parser enables you to achieve two goals:

  • Extract semantic information that can be used for later tasks (such as the creation of entries for faceted search);

  • Classify the document in some way (e.g. is it a ‘job vacancy’ or an invoice).

The classification task can be difficult because there might not be sufficient contextual evidence to satisfy a rigorous test.  If it is imperative that all document classification is 100% accurate (i.e. all “true positives”), there will be documents that fail that test but could still be important or useful.  Equally, in order to ensure all possible positives are found, you will allow some documents to be classified incorrectly (i.e. “false positives”).

With semantically enabled systems, you have more information to help make that decision.  This is because the semantic processing itself creates semantic meta-data.  So, rather than classifying with a document-term matrix (i.e. a matrix of word frequencies), as is the standard approach, we can use the more sophisticated semantic information about a document to classify it. 

There are several artificial intelligence techniques we can adopt to resolve this problem.  For instance, whilst it is almost impossible to know which semantic meta-tags, and in which combinations they occur, for a “true positive”, it is possible to train Artificial Neural Networks to learn which documents are “true positives” based on their semantic data.

An (artificial) neural network is a model of how the human brain functions and they have been applied to a wealth of classification tasks.  They are constructed from layers of artificial neurons with connections between them, modelling the brain’s neurons and synapses.  In supervised learning, as is used for this problem, a series of training examples are presented to the network which gradually learns the trends underlying the mapping from semantic data to a document classification.  A new document can then be presented to the trained network and the network will provide an output vector stating whether or not that document is a “true positive”.