05-05-04 07:34 AM
Hello
we have following questions andreson behind this question? Please advise how
dowe handle these issues?
1. We would like to search in pdf/MS word doc, so what options we have for s
earching PDF documents?
Actually, when we present the search results to the users, we do not want to
say we found what you are looking for on page 45 of this document. Then th
ey have to click to download the PDF and manually navigate to page 45.
We essentially want to pull out the textual components of the source MS Word
and/or PDF documents and make them searchable on their own.
2. It is expected that the documents will need to be converted to another fo
rmat for searching across documents - perhaps XML or XLST.
Actually the source document is an MS Word doc. Basically, each page of the
document should be treated as its own entity. Sure, it has a connection to
the document as a whole but it needs to be addressed / searchable as its o
wn body. MS Word [2003] supports saving to XML. From our basic, initial
research it appears that converting a document to XML and then writing an im
port engine to process this XML is the best way to "pull-out" the textual /
graphical images. Each page will have a header, body text, and 1 or more im
ages.
3. We expect to purchase either a MS SQL database for planning to store the
documents in the database for push out the searching and persistent storage
logic to the database. It would be ideal to utilize the power of the SQL
backend to do the searching (perform SELECT queries using full-text and/or "
.. LIKE .. " searching). This would have the benefit of performing cross-r
eferencing to other tables in the same database.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.676 / Virus Database: 438 - Release Date: 05/03/2004
[ Post a follow-up to this message ]
|