- TIKA - GUI
- TIKA - Language Detection
- TIKA - Metadata Extraction
- TIKA - Content Extraction
- TIKA - Document Type Detection
- TIKA - File Formats
- TIKA - Referenced API
- TIKA - Environment
- TIKA - Architecture
- TIKA - Overview
- TIKA - Home
TIKA Examples
- TIKA - Extracting mp3 Files
- TIKA - Extracting mp4 Files
- TIKA - Extracting Image File
- TIKA - Extracting JAR File
- TIKA - Extracting .class File
- TIKA - Extracting XML Document
- TIKA - Extracting HTML Document
- TIKA - Extracting Text Document
- TIKA - Extracting MS-Office Files
- TIKA - Extracting ODF
- TIKA - Extracting PDF
TIKA Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
TIKA - Architecture
Apppcation-Level Architecture of Tika
Apppcation programmers can easily integrate Tika in their apppcations. Tika provides a Command Line Interface and a GUI to make it user friendly.
In this chapter, we will discuss the four important modules that constitute the Tika architecture. The following illustration shows the architecture of Tika along with its four modules −
Language detection mechanism.
MIME detection mechanism.
Parser interface.
Tika Facade class.
Language Detection Mechanism
Whenever a text document is passed to Tika, it will detect the language in which it was written. It accepts documents without language annotation and adds that information in the metadata of the document by detecting the language.
To support language identification, Tika has a class called Language Identifier in the package org.apache.tika.language, and a language identification repository inside which contains algorithms for language detection from a given text. Tika internally uses N-gram algorithm for language detection.
MIME Detection Mechanism
Tika can detect the document type according to the MIME standards. Default MIME type detection in Tika is done using
. It uses the interface for most of the content type detection.Internally Tika uses several techniques pke file globs, content-type hints, magic bytes, character encodings, and several other techniques.
Parser Interface
The parser interface of org.apache.tika.parser is the key interface for parsing documents in Tika. This Interface extracts the text and the metadata from a document and summarizes it for external users who are wilpng to write parser plugins.
Using different concrete parser classes, specific for inspanidual document types, Tika supports a lot of document formats. These format specific classes provide support for different document formats, either by directly implementing the parser logic or by using external parser pbraries.
Tika Facade Class
Using Tika facade class is the simplest and direct way of calpng Tika from Java, and it follows the facade design pattern. You can find the Tika facade class in the org.apache.tika package of Tika API.
By implementing basic use cases, Tika acts as a broker of landscape. It abstracts the underlying complexity of the Tika pbrary such as MIME detection mechanism, parser interface, and language detection mechanism, and provides the users a simple interface to use.
Features of Tika
Unified parser Interface − Tika encapsulates all the third party parser pbraries within a single parser interface. Due to this feature, the user escapes from the burden of selecting the suitable parser pbrary and use it according to the file type encountered.
Low memory usage − Tika consumes less memory resources therefore it is easily embeddable with Java apppcations. We can also use Tika within the apppcation which run on platforms with less resources pke mobile PDA.
Fast processing − Quick content detection and extraction from apppcations can be expected.
Flexible metadata − Tika understands all the metadata models which are used to describe files.
Parser integration − Tika can use various parser pbraries available for each document type in a single apppcation.
MIME type detection − Tika can detect and extract content from all the media types included in the MIME standards.
Language detection − Tika includes language identification feature, therefore can be used in documents based on language type in a multi pngual websites.
Functionapties of Tika
Tika supports various functionapties −
Document type detection
Content extraction
Metadata extraction
Language detection
Document Type Detection
Tika uses various detection techniques and detects the type of the document given to it.
Content Extraction
Tika has a parser pbrary that can parse the content of various document formats and extract them. After detecting the type of the document, it selects the appropriate parser from the parser repository and passes the document. Different classes of Tika have methods to parse different document formats.
Metadata Extraction
Along with the content, Tika extracts the metadata of the document with the same procedure as in content extraction. For some document types, Tika have classes to extract metadata.
Language Detection
Internally, Tika follows algorithms pke n-gram to detect the language of the content in a given document. Tika depends on classes pke Languageidentifier and Profiler for language identification.
Advertisements