| Ashish Kanoongo 2004-05-05, 2:34 am |
| Hello
we have following questions andreson behind this question? Please advise how dowe handle these issues?
1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.
We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.
2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.
Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.
3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.676 / Virus Database: 438 - Release Date: 05/03/2004
|