Tika

Tika is a Java-based text extraction tool which is primarily aimed at handling data in unknown formats; it extracts metadata and structured content from unknown texts, and can also detect the language of the text. A feature of the tool is that it is designed to work with third-party parsers, so that developers can integrate Tika with parsers they are already familiar with.

Category: 
Text mining
Availability: 
Offline
Other software required: 
Other software required
Difficulty: 
Advanced
User Community: 
Mailing list; wiki
Active Development: 
Active development
Purpose: 
Single purpose
Operating System: 
Windows
Operating System: 
Mac
Operating System: 
Unix
System Requirements: 
Java 5 or higher; Maven 2.0 may be required to build Tika.apache