5 messages in com.googlegroups.google-enterprise-developerRe: UNC crawling with the latest updates
FromSent OnAttachments
Chris19 Feb 2007 08:59 
Sean Cooper20 Feb 2007 04:56 
Chris20 Feb 2007 06:11 
Shef200021 Feb 2007 17:03 
Shef200022 Feb 2007 10:23 
Subject:Re: UNC crawling with the latest updates
From:Shef2000 (ian.@gmail.com)
Date:02/22/2007 10:23:55 AM
List:com.googlegroups.google-enterprise-developer

Google support was nice enought to provide this answer:

Thanks for mailing Google Mini Support Team.

As the browser doesn't understand the smb:// format you got the message protocol not supported.To avoid this issue which you saw on the browser and to view the result in the http format, add the following regular expression to the "Remove URLs" field (Google Mini > Serving > Front Ends

your_frontend > Remove URLs) of that particular frontend you are using.

regexp:smb://[^/]+/([^/]+/)+$

Note: As when crawling web-based content (using the HTTP or HTTPS protocols),the appliance uses Uniform Resource Locators (URLs) to refer to individual objects (files, directories, shares and hosts) available on SMB-based network file systems. To uniquely identify a document, three components are required: the hostname, the share name, and the file path. A fourth URL component, the protocol, differentiates SMB from other types of URLs.

The file path specifies the path to the document, relative to the root of the share. If myshare on host myhost.mycompany.com shares all documents under the C:\myshare directory, requesting smb://myhost.mycompany.com/myshare/mydir/mydoc.txt will retrieve the document located at C:\myshare\mydir\mydoc.txt.

For more information on SMB crawl, please go through the following link

https://support.google.com/enterprise/doc/app/4x/FSCrawl.html

Hi Chris,

I just recieved my mini a few days ago as well and have run into the same issue. Although the document clearly states:

<quote> If you are using Windows UNC path names, you do not need to specify the protocol and you need to use a backslash ("\") instead of a forward slash. UNC entries would use this format:

\\<host>[:port]\<path>

The information contained in square brackets [ ] is optional. The backslash after <host>[:port] is required.

Valid examples:https://www.example.com/secure/http://www.example.com:80/help/ smb://fileshare.mycompany.com/ \\fileshare.mycompany.com\shared\ </quote>

It seems that a rewrite of the URL is taking place. I was succesful in setting up a samba share though:

smb://myserver/myshare/

But then search results included a smb:// so windows users could not browse to the actual files, only the cached versions.

I did find an alternate solution which included rewriting the results using XSLT so that the smb:// is replaced by \\, but have yet to try it.

I did send google a support request to see if there was some work around, and will keep you posted.

Regards Ian

On Feb 20, 6:12 am, "Chris" <itsa@gmail.com> wrote:

Thanks for the input. I just tried it out, however, I am still receiving the error:

"You have entered one or more invalid start URLs. Please check your edits."

I've tried both with and without the port: //myshare.company.com/myfolder/ //myshare.company.com:139/myfolder/

-Chris

On Feb 20, 7:56 am, "Sean Cooper" <seco@mitre.org> wrote:

Have you tried using forward slashes?

i.e. //myshare.company.com/myfolder/

The search appliances are running Linux which I believe uses forward slashes for network shares.

On Feb 19, 11:59 am, "Chris" <itsa@gmail.com> wrote:

I was looking at the documentation for the Search Appliance today, and noticed that under "Crawl and Index > Crawl URLs" that we can now specify UNC paths.

However, when I attempt to specify a UNC path, such as \ \myshare.company.com\myfolder\, the appliance assumes that it is a mistyped http path and changes it to something like http://%5C%5Cmyshare.company.com%5Cmyfolder%5C/.

Additionally, under the "Follow and Crawl..." heading, I try to enter \ \myshare.company.com\, however, it will only accept this if I end the line with a forward slash: \\myshare.company.com\/

Has anybody else tried anything with UNC paths? If so, were you able to successfully crawl?

Any input would be greatly appreciated.

Thanks, -Chris- Hide quoted text -

- Show quoted text -- Hide quoted text -

- Show quoted text -- Hide quoted text -

- Show quoted text -