- TIKA - GUI
- TIKA - Language Detection
- TIKA - Metadata Extraction
- TIKA - Content Extraction
- TIKA - Document Type Detection
- TIKA - File Formats
- TIKA - Referenced API
- TIKA - Environment
- TIKA - Architecture
- TIKA - Overview
- TIKA - Home
TIKA Examples
- TIKA - Extracting mp3 Files
- TIKA - Extracting mp4 Files
- TIKA - Extracting Image File
- TIKA - Extracting JAR File
- TIKA - Extracting .class File
- TIKA - Extracting XML Document
- TIKA - Extracting HTML Document
- TIKA - Extracting Text Document
- TIKA - Extracting MS-Office Files
- TIKA - Extracting ODF
- TIKA - Extracting PDF
TIKA Useful Resources
Selected Reading
- Who is Who
- Computer Glossary
- HR Interview Questions
- Effective Resume Writing
- Questions and Answers
- UPSC IAS Exams Notes
TIKA - Document Type Detection
MIME Standards
Multipurpose Internet Mail Extensions (MIME) standards are the best available standards for identifying document types. The knowledge of these standards helps the browser during internal interactions.
Whenever the browser encounters a media file, it chooses a compatible software available with it to display its contents. In case it does not have any suitable apppcation to run a particular media file, it recommends the user to get the suitable plugin software for it.
Type Detection in Tika
Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and its document type. To detect media types, Tika internally uses the following mechanisms.
File Extensions
Checking the file extensions is the simplest and most-widely used method to detect the format of a file. Many apppcations and operating systems provide support for these extensions. Shown below are the extension of a few known file types.
File name | Extention |
---|---|
image | .jpg |
audio | .mp3 |
java archive file | .jar |
java class file | .class |
Content-type Hints
Whenever you retrieve a file from a database or attach it to another document, you may lose the file’s name or extension. In such cases, the metadata suppped with the file is used to detect the file extension.
Magic Byte
Observing the raw bytes of a file, you can find some unique character patterns for each file. Some files have special byte prefixes called magic bytes that are specially made and included in a file for the purpose of identifying the file type
For example, you can find CA FE BA BE (hexadecimal format) in a java file and %PDF (ASCII format) in a pdf file. Tika uses this information to identify the media type of a file.
Character Encodings
Files with plain text are encoded using different types of character encoding. The main challenge here is to identify the type of character encoding used in the files. Tika follows character encoding techniques pke Bom markers and Byte Frequencies to identify the encoding system used by the plain text content.
XML Root Characters
To detect XML documents, Tika parses the xml documents and extracts the information such as root elements, namespaces, and referenced schemas from where the true media type of the files can be found.
Type Detection using Facade Class
The detect() method of facade class is used to detect the document type. This method accepts a file as input. Shown below is an example program for document type detection with Tika facade class.
import java.io.File; import org.apache.tika.Tika; pubpc class Typedetection { pubpc static void main(String[] args) throws Exception { //assume example.mp3 is in your current directory File file = new File("example.mp3");// //Instantiating tika facade class Tika tika = new Tika(); //detecting the file type using detect method String filetype = tika.detect(file); System.out.println(filetype); } }
Save the above code as TypeDetection.java and run it from the command prompt using the following commands −
javac TypeDetection.java java TypeDetection audio/mpegAdvertisements