|
Home > Archive > IIS Index Server > May 2004 > For Hillary - Help Me
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
For Hillary - Help Me
|
|
| Ashish Kanoongo 2004-05-06, 9:44 am |
| Hello
we have following questions andreson behind this question? Please advise how dowe handle these issues?
1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.
We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.
2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.
Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.
3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004
| |
| Hilary Cotter 2004-05-07, 6:34 am |
| 1) download the PDF iFilter from adobe for this. There is no way of telling the user the hit is on page 45 however. There is a company that is working on this, but I am not sure how far they have gotten with this. From the wording of your question it sounds like IS will do exactly what you want, ie no page indication
2) have a look at filtdump -b
3) ok, SQL 2005 FTS offers much better performance than SQL 2000 or SQL 7 FTS
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:OA3ZR42MEHA.2064@TK2MSFTNGP12.phx.gbl...
Hello
we have following questions andreson behind this question? Please advise how dowe handle these issues?
1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.
We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.
2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.
Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.
3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004
| |
| Ashish Kanoongo 2004-05-07, 8:37 am |
| Hillary
Thanks for the information, can u tell me the company name/whereabout who is working on jump to page logic.
Ashish
"Hilary Cotter" <hilaryk@att.net> wrote in message news:ef9VW5BNEHA.1196@TK2MSFTNGP11.phx.gbl...
1) download the PDF iFilter from adobe for this. There is no way of telling the user the hit is on page 45 however. There is a company that is working on this, but I am not sure how far they have gotten with this. From the wording of your question it sounds like IS will do exactly what you want, ie no page indication
2) have a look at filtdump -b
3) ok, SQL 2005 FTS offers much better performance than SQL 2000 or SQL 7 FTS
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:OA3ZR42MEHA.2064@TK2MSFTNGP12.phx.gbl...
Hello
we have following questions andreson behind this question? Please advise how dowe handle these issues?
1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.
We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.
2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.
Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.
3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004
| |
| Hilary Cotter 2004-05-11, 5:42 pm |
| ba-insight.net
--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:%23fITzmCNEHA.3452@TK2MSFTNGP10.phx.gbl...
Hillary
Thanks for the information, can u tell me the company name/whereabout who is working on jump to page logic.
Ashish
"Hilary Cotter" <hilaryk@att.net> wrote in message news:ef9VW5BNEHA.1196@TK2MSFTNGP11.phx.gbl...
1) download the PDF iFilter from adobe for this. There is no way of telling the user the hit is on page 45 however. There is a company that is working on this, but I am not sure how far they have gotten with this. From the wording of your question it sounds like IS will do exactly what you want, ie no page indication
2) have a look at filtdump -b
3) ok, SQL 2005 FTS offers much better performance than SQL 2000 or SQL 7 FTS
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:OA3ZR42MEHA.2064@TK2MSFTNGP12.phx.gbl...
Hello
we have following questions andreson behind this question? Please advise how dowe handle these issues?
1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.
We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.
2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.
Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.
3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004
|
|
|
|
|