Community Post

Java team briefing about Apache Tika

Ethan Millar
👁️ 0 views
💬 comments

This post is intended by leading java development support team for developers who want to learn about Apache Tika, a portable Parser. You can Read this post and learn about the topic.

Introduction

Apache Tika is Java based portable Parser. The Apache Tika framework is developed using Java Language. If you want to work with Apache Tika, then development environment must have java 1.5 or higher.

What is parser?

Parser is a data extraction which takes input as a file and produces the file data into different format. There are many parsers in the market. For example, XML Parser reads the XML data and produces the XML data into text format. Apache Tika is a portable content parser, it will take variety of input files and gives output in one particular format. The other parsers in the market accept only one type of data for extraction. For example, XML Parser always take input as a XML Data for extraction. In the case of Tika, it accepts many type of file formats, such as MS Office documents, XML, PDF and Jar etc.

How Apache Tika works?

Following are the two main interfaces.

  • Parser
  • Detector

Parser & Detector

Parser is an interface which hides the complexity of the system to client application. Client can extract any format of files such as XML, PPT, and XSL etc. Clients don’t have to worry about which API needs to be used for different types of file format. The client application just uses the parser interface by passing any format of file.

The Detector responsibility is to find out the type of input file format and pass it to correct parser interface for data extraction. The following diagram tells about high level Tika System diagram:

Following topics are covered in this article:

  • Language detection for a document
  • Metadata about the document
  • Reading various type of document content

Download Apache Tika from following official website (Latest Version: tika-app-1.13.jar).

https://tika.apache.org/download.html

Language detection example

Tika provides powerful feature called language detection. If you want to find out the language for particular document, Tika provides language identifier API to find out the language of the Document.

LanguageIdentifier identifier = new 
LanguageIdentifier("What are you doing");
System.out.println(identifier.getLanguage());

The program returns output as “en”

File type detector example

try{
    File file = newFile("tika-app-1.13.jar");//
    Tikatika = newTika();
    String filetype = tika.detect(file);
    System.out.println(“The file type is  “ + filetype);
    }catch(Exception ex){
        ex.printStackTrace();
}

Output:

The file type is application/java-archive

The above code is used to find out the file format. The Tika class is façade class which internally invokes appropriate API for file extraction process. The meaning of façade class is to hide the system complexity from the client-side, client has to pass only the filename. The façade class invokes correct API internally for given file type.

Tika Façade class System diagram

Client application knows only about the Tika Façade class and all complexities are behind the Tika class. That’s why it is called as Tika Façade class.

MetaData Detector

Metadata is information about particular object or entity. For example, if you have a jpg file with information like, file size, resolution etc. This type of information is called as metadata.


try{
        File file = newFile("golden-words.jpg");
        Parser parser = newAutoDetectParser();
        BodyContentHandlerhandler = newBodyContentHandler();
        Metadata metadata = newMetadata();
        FileInputStreaminputstream = newFileInputStream(file);
        ParseContextcontext = newParseContext();
        parser.parse(inputstream, handler, metadata, context);
        System.out.println(handler.toString());
        String[] metadataNames = metadata.names();
        for(inti = 0; i

Ethan Millar

3 posts

I am a technical writer for the last nine years at Aegis Softtech. I especially write articles related to Java, support, services and latest updates .