IIS Index Server - For Hillary - Help Me

This is Interesting: Free IT Magazines  
Home > Archive > IIS Index Server > May 2004 > For Hillary - Help Me





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author For Hillary - Help Me
Ashish Kanoongo

2004-05-06, 9:44 am

Hello

we have following questions andreson behind this question? Please advise how dowe handle these issues?

1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.

We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.

2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.

Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.

3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004
Hilary Cotter

2004-05-07, 6:34 am

1) download the PDF iFilter from adobe for this. There is no way of telling the user the hit is on page 45 however. There is a company that is working on this, but I am not sure how far they have gotten with this. From the wording of your question it sounds like IS will do exactly what you want, ie no page indication

2) have a look at filtdump -b

3) ok, SQL 2005 FTS offers much better performance than SQL 2000 or SQL 7 FTS
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:OA3ZR42MEHA.2064@TK2MSFTNGP12.phx.gbl...
Hello

we have following questions andreson behind this question? Please advise how dowe handle these issues?

1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.

We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.

2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.

Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.

3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004

Ashish Kanoongo

2004-05-07, 8:37 am

Hillary

Thanks for the information, can u tell me the company name/whereabout who is working on jump to page logic.

Ashish
"Hilary Cotter" <hilaryk@att.net> wrote in message news:ef9VW5BNEHA.1196@TK2MSFTNGP11.phx.gbl...
1) download the PDF iFilter from adobe for this. There is no way of telling the user the hit is on page 45 however. There is a company that is working on this, but I am not sure how far they have gotten with this. From the wording of your question it sounds like IS will do exactly what you want, ie no page indication

2) have a look at filtdump -b

3) ok, SQL 2005 FTS offers much better performance than SQL 2000 or SQL 7 FTS
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:OA3ZR42MEHA.2064@TK2MSFTNGP12.phx.gbl...
Hello

we have following questions andreson behind this question? Please advise how dowe handle these issues?

1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.

We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.

2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.

Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.

3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004
Hilary Cotter

2004-05-11, 5:42 pm

ba-insight.net

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:%23fITzmCNEHA.3452@TK2MSFTNGP10.phx.gbl...
Hillary

Thanks for the information, can u tell me the company name/whereabout who is working on jump to page logic.

Ashish
"Hilary Cotter" <hilaryk@att.net> wrote in message news:ef9VW5BNEHA.1196@TK2MSFTNGP11.phx.gbl...
1) download the PDF iFilter from adobe for this. There is no way of telling the user the hit is on page 45 however. There is a company that is working on this, but I am not sure how far they have gotten with this. From the wording of your question it sounds like IS will do exactly what you want, ie no page indication

2) have a look at filtdump -b

3) ok, SQL 2005 FTS offers much better performance than SQL 2000 or SQL 7 FTS
"Ashish Kanoongo" <ashishk@armour.com> wrote in message news:OA3ZR42MEHA.2064@TK2MSFTNGP12.phx.gbl...
Hello

we have following questions andreson behind this question? Please advise how dowe handle these issues?

1. We would like to search in pdf/MS word doc, so what options we have for searching PDF documents?
Actually, when we present the search results to the users, we do not want to say we found what you are looking for on page 45 of this document. Then they have to click to download the PDF and manually navigate to page 45.

We essentially want to pull out the textual components of the source MS Word and/or PDF documents and make them searchable on their own.

2. It is expected that the documents will need to be converted to another format for searching across documents - perhaps XML or XLST.

Actually the source document is an MS Word doc. Basically, each page of the document should be treated as its own entity. Sure, it has a connection to the document as a whole but it needs to be addressed / searchable as its own body. MS Word [2003] supports saving to XML. From our basic, initial research it appears that converting a document to XML and then writing an import engine to process this XML is the best way to "pull-out" the textual / graphical images. Each page will have a header, body text, and 1 or more images.

3. We expect to purchase either a MS SQL database for planning to store the documents in the database for push out the searching and persistent storage logic to the database. It would be ideal to utilize the power of the SQL backend to do the searching (perform SELECT queries using full-text and/or " .. LIKE .. " searching). This would have the benefit of performing cross-referencing to other tables in the same database.





---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.677 / Virus Database: 439 - Release Date: 05/04/2004

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com