Community Post

Java team briefing about Apache Tika

Ethan Millar

This post is intended by leading java development support team for developers who want to learn about Apache Tika, a portable Parser. You can Read this post and learn about the topic.

Introduction

Apache Tika is Java based portable Parser. The Apache Tika framework is developed using Java Language. If you want to work with Apache Tika, then development environment must have java 1.5 or higher.

What is parser?

Parser is a data extraction which takes input as a file and produces the file data into different format. There are many parsers in the market. For example, XML Parser reads the XML data and produces the XML data into text format. Apache Tika is a portable content parser, it will take variety of input files and gives output in one particular format. The other parsers in the market accept only one type of data for extraction. For example, XML Parser always take input as a XML Data for extraction. In the case of Tika, it accepts many type of file formats, such as MS Office documents, XML, PDF and Jar etc.

How Apache Tika works?

Following are the two main interfaces.

  • Parser
  • Detector

Parser & Detector

Parser is an interface which hides the complexity of the system to client application. Client can extract any format of files such as XML, PPT, and XSL etc. Clients don’t have to worry about which API needs to be used for different types of file format. The client application just uses the parser interface by passing any format of file.

The Detector responsibility is to find out the type of input file format and pass it to correct parser interface for data extraction. The following diagram tells about high level Tika System diagram:

Following topics are covered in this article:

  • Language detection for a document
  • Metadata about the document
  • Reading various type of document content

Download Apache Tika from following official website (Latest Version: tika-app-1.13.jar).

https://tika.apache.org/download.html

Language detection example

Tika provides powerful feature called language detection. If you want to find out the language for particular document, Tika provides language identifier API to find out the language of the Document.

LanguageIdentifier identifier = new 
LanguageIdentifier("What are you doing");
System.out.println(identifier.getLanguage());

The program returns output as “en”

File type detector example

try{
    File file = newFile("tika-app-1.13.jar");//
    Tikatika = newTika();
    String filetype = tika.detect(file);
    System.out.println(“The file type is  “ + filetype);
    }catch(Exception ex){
        ex.printStackTrace();
}

Output:

The file type is application/java-archive

The above code is used to find out the file format. The Tika class is façade class which internally invokes appropriate API for file extraction process. The meaning of façade class is to hide the system complexity from the client-side, client has to pass only the filename. The façade class invokes correct API internally for given file type.

Tika Façade class System diagram

Client application knows only about the Tika Façade class and all complexities are behind the Tika class. That’s why it is called as Tika Façade class.

MetaData Detector

Metadata is information about particular object or entity. For example, if you have a jpg file with information like, file size, resolution etc. This type of information is called as metadata.

try{
        File file = newFile("golden-words.jpg");
        Parser parser = newAutoDetectParser();
        BodyContentHandlerhandler = newBodyContentHandler();
        Metadata metadata = newMetadata();
        FileInputStreaminputstream = newFileInputStream(file);
        ParseContextcontext = newParseContext();
        parser.parse(inputstream, handler, metadata, context);
        System.out.println(handler.toString());
        String[] metadataNames = metadata.names();
        for(inti = 0; i<metadataNames.length; i++){     
               String metaDataName = metadataNames[i];
            System.out.println(metaDataName + " : " + metadata.get(metaDataName));
          }
    }catch(Exception ex) {
            ex.printStackTrace();
    }

Above sample code is used for fetching metadata for particular file format. The above example fetches the metadata for jpg file. In the above example, I have used following four classes which are part of Tika API. These four classes are commonly used in all kind of file extractions.

Let’s take a look and understand about these classes:

  1. Parser
  2. BodyContentHandler
  3. MetaData
  4. ParserContext

Parser

Parser class automatically detects the type of file and assigns correct parser for data extraction.

BodyContentHandler

This class handles the file’s data and it converts the file data into XHTML format, then Tika API uses the SAXParser to read the file data.

MetaData

This class is used to obtain the information about the particular file details. The client application should initialize this class. This class should be in part of parsemethod parameter. Tika gets the file details and populate those details in Metadata class. File details, nothing but content type of file, author of file, File creation date, Application name (Ex. Microsoft Excel, PDF etc.,) etc.,

ParserContext

This class is used to customize the way of extracting the file. If you don’t want do customize, just initialize this class and pass this class object in parse method parameter.

Reading MS Office spread sheet from Tika

try{
BodyContentHandlerbodyContentHandler = newBodyContentHandler();
    Metadata metadata = newMetadata();
    FileInputStreaminput = newFileInputStream(new File("employee.xlsx"));
    ParseContextparseContext = newParseContext();
//Office Open XML (OOXML) parsers 
    OOXMLParsermsofficeParser = newOOXMLParser (); 
    msofficeParser.parse(input, bodyContentHandler, metadata,parseContext);
System.out.println("Spreadsheet Data " + bodyContentHandler.toString());
}catch(Exception ex){
    ex.printStackTrace();
}

The above program reads the data from spreadsheet document. The above example is not an automatic file type detector/parser. In this case, you should know the API name to read Spreadsheet data. If we use AutodetectorParser class, then there is no need to know about the parser API name. You can directly use the parse method to fetch the data as in below code.

//Office Open XML (OOXML) parsers 
Parser parser = newAutoDetectParser();
parser.parse(input, bodyContentHandler, metadata,parseContext);

Reading MP4 file
try{
BodyContentHandlerbodyContentHandler = newBodyContentHandler();
Metadata metadata = newMetadata();
FileInputStreaminput = newFileInputStream(new File("song.mp4"));
ParseContextparseContext = newParseContext();
//MP4 parser
MP4Parser Mp4Parser = new  MP4Parser();
Mp4Parser.parse(input, bodyContentHandler, metadata, parseContext);
LyricsHandlerlyrics = newLyricsHandler(input,bodyContentHandler);
    while(lyrics.hasLyrics()) {
        System.out.println(lyrics.toString());
        }
System.out.println("MP4 Data " + bodyContentHandler.toString());
}catch(Exception ex){
    ex.printStackTrace();
}

The MP4Parser class is used to extract the MP4 file data. If you want to fetch the Metadata information, use following code.

String metaDataNames[] = metadata.names();
    for(inti = 0; i<metaDataNames.length; i++) {
System.out.println(metaDataNames[i] + " = " + metadata.get(metaDataNames[i]));
    }

If you want to use AutoDetectParser class to read MP4, replace below code.

// MP4Parser Mp4Parser = new  MP4Parser(); commented this line and use below code in the above program.
ParserMp4Parser = newAutoDetectParser();
Mp4Parser.parse(input, bodyContentHandler, metadata, parseContext);

You can use the above extractor code for all type of file extraction. There are many parser supported by Apache Tika. Some of the supported Parsers are listed below.

PDFParser

This is for extracting data from PDF file.

OpenDocumentParser

This is for extracting data from ODF file.

TXTParser

This is for reading text file.

HtmlParser

This is for reading html file.

XMLParser

This is for reading XML Data.

ClassParser

This is for reading data from Java class file.

PackageParser

This is for reading data from Jar file.

JpegParser

This is for reading JPG file.

Mp3Parser

This is for reading MP3 file format.

EpubParser

Electronic Publication Format (EPUB). This is used for digital books

RTFParser

This is for reading rich text documents

FLVParser

This is for reading Flash file.

AudioParser

This is for reading Audio file.

All supported format are available in following link

https://tika.apache.org/1.13/formats.html

Conclusion

If you want to use parser for particular file type then you don’t have to use Apache Tika. But, your project is having a requirement to fetch many type of files, hence, you should use Apache Tika instead of integrating many parser in your Project. Tika uses very less memory, so it is good to use when you have requirement for fetching data in multiple type of files.

For any query, you can contact java development support and ask your questions. Comments are open for this post, you can share your feedback with the readers.

Ethan Millar

3 posts

I am a technical writer for the last nine years at Aegis Softtech. I especially write articles related to Java, support, services and latest updates .