DOCEX

 
"There are known knowns… there are known unknowns… but there are also unknown unknowns - the ones we don't know we don't know". 
Donald Rumsfeld

Overview
IntuScan™ Docex is a fully automated multi-lingual and “hybrid-lingual” integrated semantics-driven platform, which processes any text in a supported language. The purpose of IntuScan™ Docex is to extract all relevant information from
a large quantity of unstructured texts written in a variety of languages, and to generate a structured representation and natural language report, which includes the characterization of the document, identified entities, and other information implicit in the document.
 
The key features of IntuScan™ Docex are:
  • Capability to operate on “hybrid languages” such as “Spanglish” – Hispanic English and to identify not only the language but the language “register” of a text – social, ethnic or educational strata.
  •  Capability to determine the relevance and category of an input text to the needs of the client.
  •  Capability to extract from the text entities according to their roles (persons, places, organizations, events, objects).
  •  Capability to identify ideas and sentiments Vis-à-vis entities and aggregates such sentiment into sentiment towards their common “parent”.
  •  Capability to identify, disambiguate and match names of entities and links them to contextual information for further identification, creating a virtual “identity card” of the entity behind that name: possible affiliations, name variants, ethnic origin, gender, links etc.
  • Capability to identify sentiment towards entities that appear in the text and “super-entities” that are alluded to in the text but not explicitly mentioned.
 Language Identification
IntuScan™ DOCEX begins the analysis process by identifying the languages of a given document sorted by their prominence. IntuScan currently identifies more than 60 languages; some of them may use the same script. IntuScan™ also distinguishes between parts of the same texts written in different languages, and identifies “loan words” and phrases in “hybrid languages” – concepts from a “guest” language that are transliterated and integrated into the “host” languages – and restores them to the source language for extraction of meaning. For some of the languages, IntuScan™ DOCEX identifies the general domain of the document (e.g. drugs- trafficking, chemicals, terror, etc.) allowing the following phrases to use a fine-grained logic that matches the specific domain.
IntuScan™ DOCEX currently supports the following languages: Arabic, English, French, Indonesian, Spanish, and Urdu. The following languages will be available in the near future: German, Persian, Russian, Somali, and Turkish.

Idea Mining
IntuScan™ DOCEX analyzes the text using its powerful integrated Natural Language Processing (NLP) engine to uncover language independent concepts that are defined in an ontology for various domains. As opposed to existing approaches, which follow the “bag-of- words” idea, IntuScan™ DOCEX uses sophisticated linguistic tools to identify complex expressions and to resolve ambiguous expressions according to their domain context. IntuView has developed in-depth ontologies for several domains relating to the fields of homeland-security, finance, and law-enforcement. Unlike traditional flat ontology methods, the IntuScan™ ontology is a multi-dimensional structure of relationships between unique concepts. Users can work with the provided ontologies or extend and enrich them to meet their specific requirements, by using a knowledge management tool.

Sentiment Analysis
IntuScan™ identifies the attitude of the author of the given text towards different entities, drills down into sentiment towards specific attributes or features of the entities and aggregates sentiment towards a set of linked entities with common “parent” entities (e.g. negative sentiment towards entities that belong to a parent entity (e.g. senior officials of a country who all have in common their affinity to the country or a number of products that all belong to a certain company) may indicate a general negative sentiment towards the parent entity The sentiment analysis algorithms are based on statistical models composed of semantic features.

Named Entity Recognition and Analysis
IntuScan™ DOCEX recognizes all the named entities (e.g. persons, organizations, locations, facilities, dates, events) existing in the document, transliterates them into various standards, and analyzes them to find the roles of each entity name part. The recognition component is based on a hybrid approach that combines statistical models along with rules, based on linguistic and cultural knowledge (naming conventions, etc.) for identifying and classifying entities. This method extracts implicit information derived from the names (gender, ethnicity, etc.) and contextual information surrounding them (titles, types, sentiment, etc.) in order to aggregate and disambiguate entities and to find affinities between them. The extracted information is modeled according to the ontology and formatted as RDF structures. The information can then be used by IntuScan™ Name Matcher to match the extracted entities with existing lists.

Domain-based Categorization
IntuScan™ DOCEX categorizes a given document according to a predefined list of categories and values. The categories are carefully selected for each domain as part of the domain ontology. The values are assigned based on statistical calculations that aim to find the similarity of the given document to a corpus of manually annotated documents of the same category. The similarity is measured using the ideas, entities, and other information extracted from the document in previous steps.

Inference of Implicit Ideas
IntuScan™ DOCEX uses the extracted entities, ideas, categories, and the general context in which they occur to discover additional implicit information such as sentiment, affiliation, and key themes of the analyzed document. This is performed by ontology-based rules that capture domain-specific patterns and structures. The rules can be easily modified to comply with the user’s specific requirements.

Generation of Structured and Unstructured Reports
IntuScan™ DOCEX generates a natural language report for any given document. The report characterizes the document and presents the key ideas in their appropriate context. The report is currently available in English and French. This information is also generated in a RDF structure that is stored in a triple-store semantic database for future query. Currently IntuScan™ DOCEX is fully integrated with Oracle 11g and other database platforms.