Extra URLs REX

Thunderstone Search Appliance Manual

Extra URLs REX

Syntax: zero or more regular expressions (REX), separated by space or line break

Restricts walks to fetch URLs only matching any of the specified regular expressions anywhere in the URL (hostname, path, or query) when the Base URL matches.

If a Base URL is matched by an Extra URLs REX, then the only URLs that match the Extra URLs REX will be walked on that host. If a Base URL does not match an Extra URLs REX, then it is walked as normal.

It is a rarely used setting, most commonly used in conjunction with a hostname to fetch matching URLs on an additional host. Links still need to be found to those pages for them to be indexed.

For example, with the following Extra URLs REX:

>>=http://products\.example\.com=!supplierid+supplierid\=BigCo

(which matches a URL that begins with products.example.com and contains supplierid=BigCo), and using the following Base URLs:

http://products.example.com/listProducts.aspx?supplierid=BigCo http://help.example.com/index.aspx

The Extra URLs REX matches the products.example.com URL, so only pages with supplier=BigCo will be walked, while all of help.example.com will be walked (following other inclusion/exclusion rules).

Available from version 4.3.9.

See also Extra Domains, here. See here for details on REX search syntax.