Community Post

Java team briefing about Apache Tika

This post is intended by leading java development support team for developers who want to learn about Apache Tika, a portable Parser. You can Read this post and learn about the topic.


Apache Tika is Java based portable Parser. The Apache Tika framework is developed using Java Language. If you want to work with Apache Tika, then development environment must have java 1.5 or higher.

Table of Contents

    What is parser?

    Parser is a data extraction which takes input as a file and produces the file data into different format. There are many parsers in the market. For example, XML Parser reads the XML data and produces the XML data into text format. Apache Tika is a portable content parser, it will take variety of input files and gives output in one particular format. The other parsers in the market accept only one type of data for extraction. For example, XML Parser always take input as a XML Data for extraction. In the case of Tika, it accepts many type of file formats, such as MS Office documents, XML, PDF and Jar etc.

    How Apache Tika works?

    Following are the two main interfaces.

    • Parser
    • Detector

    Parser & Detector

    Parser is an interface which hides the complexity of the system to client application. Client can extract any format of files such as XML, PPT, and XSL etc. Clients don’t have to worry about which API needs to be used for different types of file format. The client application just uses the parser interface by passing any format of file.

    The Detector responsibility is to find out the type of input file format and pass it to correct parser interface for data extraction. The following diagram tells about high level Tika System diagram:

    Following topics are covered in this article:

    • Language detection for a document
    • Metadata about the document
    • Reading various type of document content

    Download Apache Tika from following official website (Latest Version: tika-app-1.13.jar).

    Language detection example

    Tika provides powerful feature called language detection. If you want to find out the language for particular document, Tika provides language identifier API to find out the language of the Document.

    LanguageIdentifier identifier = new 
    LanguageIdentifier("What are you doing");

    The program returns output as “en”

    File type detector example

        File file = newFile("tika-app-1.13.jar");//
        Tikatika = newTika();
        String filetype = tika.detect(file);
        System.out.println(“The file type is  “ + filetype);
        }catch(Exception ex){


    The file type is application/java-archive

    The above code is used to find out the file format. The Tika class is façade class which internally invokes appropriate API for file extraction process. The meaning of façade class is to hide the system complexity from the client-side, client has to pass only the filename. The façade class invokes correct API internally for given file type.

    Tika Façade class System diagram

    Client application knows only about the Tika Façade class and all complexities are behind the Tika class. That’s why it is called as Tika Façade class.

    MetaData Detector

    Metadata is information about particular object or entity. For example, if you have a jpg file with information like, file size, resolution etc. This type of information is called as metadata.

            File file = newFile("golden-words.jpg");
            Parser parser = newAutoDetectParser();
            BodyContentHandlerhandler = newBodyContentHandler();
            Metadata metadata = newMetadata();
            FileInputStreaminputstream = newFileInputStream(file);
            ParseContextcontext = newParseContext();
            parser.parse(inputstream, handler, metadata, context);
            String[] metadataNames = metadata.names();
            for(inti = 0; i<metadataNames.length; i++){     
                   String metaDataName = metadataNames[i];
                System.out.println(metaDataName + " : " + metadata.get(metaDataName));
        }catch(Exception ex) {

    Above sample code is used for fetching metadata for particular file format. The above example fetches the metadata for jpg file. In the above example, I have used following four classes which are part of Tika API. These four classes are commonly used in all kind of file extractions.

    Let’s take a look and understand about these classes:

    1. Parser
    2. BodyContentHandler
    3. MetaData
    4. ParserContext


    Parser class automatically detects the type of file and assigns correct parser for data extraction.


    This class handles the file’s data and it converts the file data into XHTML format, then Tika API uses the SAXParser to read the file data.


    This class is used to obtain the information about the particular file details. The client application should initialize this class. This class should be in part of parsemethod parameter. Tika gets the file details and populate those details in Metadata class. File details, nothing but content type of file, author of file, File creation date, Application name (Ex. Microsoft Excel, PDF etc.,) etc.,


    This class is used to customize the way of extracting the file. If you don’t want do customize, just initialize this class and pass this class object in parse method parameter.

    Reading MS Office spread sheet from Tika

    BodyContentHandlerbodyContentHandler = newBodyContentHandler();
        Metadata metadata = newMetadata();
        FileInputStreaminput = newFileInputStream(new File("employee.xlsx"));
        ParseContextparseContext = newParseContext();
    //Office Open XML (OOXML) parsers 
        OOXMLParsermsofficeParser = newOOXMLParser (); 
        msofficeParser.parse(input, bodyContentHandler, metadata,parseContext);
    System.out.println("Spreadsheet Data " + bodyContentHandler.toString());
    }catch(Exception ex){

    The above program reads the data from spreadsheet document. The above example is not an automatic file type detector/parser. In this case, you should know the API name to read Spreadsheet data. If we use AutodetectorParser class, then there is no need to know about the parser API name. You can directly use the parse method to fetch the data as in below code.

    //Office Open XML (OOXML) parsers 
    Parser parser = newAutoDetectParser();
    parser.parse(input, bodyContentHandler, metadata,parseContext);
    Reading MP4 file
    BodyContentHandlerbodyContentHandler = newBodyContentHandler();
    Metadata metadata = newMetadata();
    FileInputStreaminput = newFileInputStream(new File("song.mp4"));
    ParseContextparseContext = newParseContext();
    //MP4 parser
    MP4Parser Mp4Parser = new  MP4Parser();
    Mp4Parser.parse(input, bodyContentHandler, metadata, parseContext);
    LyricsHandlerlyrics = newLyricsHandler(input,bodyContentHandler);
        while(lyrics.hasLyrics()) {
    System.out.println("MP4 Data " + bodyContentHandler.toString());
    }catch(Exception ex){

    The MP4Parser class is used to extract the MP4 file data. If you want to fetch the Metadata information, use following code.

    String metaDataNames[] = metadata.names();
        for(inti = 0; i<metaDataNames.length; i++) {
    System.out.println(metaDataNames[i] + " = " + metadata.get(metaDataNames[i]));

    If you want to use AutoDetectParser class to read MP4, replace below code.

    // MP4Parser Mp4Parser = new  MP4Parser(); commented this line and use below code in the above program.
    ParserMp4Parser = newAutoDetectParser();
    Mp4Parser.parse(input, bodyContentHandler, metadata, parseContext);

    You can use the above extractor code for all type of file extraction. There are many parser supported by Apache Tika. Some of the supported Parsers are listed below.


    This is for extracting data from PDF file.


    This is for extracting data from ODF file.


    This is for reading text file.


    This is for reading html file.


    This is for reading XML Data.


    This is for reading data from Java class file.


    This is for reading data from Jar file.


    This is for reading JPG file.


    This is for reading MP3 file format.


    Electronic Publication Format (EPUB). This is used for digital books


    This is for reading rich text documents


    This is for reading Flash file.


    This is for reading Audio file.

    All supported format are available in following link


    If you want to use parser for particular file type then you don’t have to use Apache Tika. But, your project is having a requirement to fetch many type of files, hence, you should use Apache Tika instead of integrating many parser in your Project. Tika uses very less memory, so it is good to use when you have requirement for fetching data in multiple type of files.

    For any query, you can contact java development support and ask your questions. Comments are open for this post, you can share your feedback with the readers.

    Ethan Millar

    3 posts

    I am a technical writer for the last nine years at Aegis Softtech. I especially write articles related to Java, support, services and latest updates .