On Fri, 23 Jul 2004 02:25:06 -0500, James wrote:
> Hmm - so it all comes down to work flow within the company but with a solid
> set of tools - and then relying on staff dicipline to work with them - I
> like the idea of useing apache as the underlying technology along with
> squid, I must do some reading on this to see what modules are available to
> enhance the doc sharing.
Well, you can use squid, but I didn't. Apache can work as a proxy server
itself, i.e. serve out it's own pages, and also work as a proxy for other
web servers. Just enable the "proxy module" in the apache httpd.conf file.
Worked OK for me at at time when I was going out from my LAN to the
internet through a gateway machine (my own) and a dialup to my ISP. Squid
has been designed to be a proxy server. Never used it. Considered it, but
haven't done enough research, to figure out which it better:
- apache w. it's own proxy module
- apache & squid
If you do the research, would you let us know which is better, and why?
I suspect that squid might (should?) be better at managing big cache?
> Any idea whether rsync or rsync technology can be integrated into squid or
> should I be looking for another caching proxy ?
Well, rsync is something different, not related to proxy. The rsync
protocol is particularly efficient at transferring contents, because it
checks for changes (even withing files, I believe) by some kind of block
checksum scheme. It does take a bit of CPU. Slow machines don't like it.
If you have the two web servers at the 2 sites: call them
www.aaa.site.com
and
www.bbb.site.com, then when you use a proxy, you always refer to them
with the site (that "owns" the web content) but you setup your browser to
direct this request to the local proxy server. So, at site aaa, you setup
your browsers to point to
www.aaa.site.com, and refer to
http://www.bbb.site.com/path/document to get a document from the other
side. Well, you know how a proxy works. If you use rsync to transfer
copies of the files, then you are NOT filling up the proxy cache, but are
creating duplicates on your site server. Then you would refer to them as
http://www.aaa.site.com/path/document (the copy). The proxy cache has it's
own special structure to make lookups more efficient. If you duplicate
documents with rsync then you have to manage the 2 copies and deal with
merging them if you have edit both of them, etc. Why duplicate?
You might decide to leave the documents "owned" by the 2 sites at the web
servers on the 2 sites, and avoid duplication (except in the proxy
caches). To "preload" your cache for the next day's workload, you could
run a script that simply browses those web pages overnight. Just
referencing them through the proxy server will force the server to get the
copy from the other side and put it into cache. If your cache is big
enough the documents will stay there all day (week? month?). Any random
reference to a document will of course take the full/slow download.
One idea might be to have huge cache, and use (off the top of my head,
some tool like) "wget -r" (recursive) wget, "throwing away" the retrieved
contents, just to "populate the cache". As documents are edited on either
side, cache contents will start to go stale. Might need periodic refresh.
Without knowing more about your application it's hard to recommend whether
to use (or not use) rsync and/or duplication of content. It might be that
a proxy web server could be enough for your purposes. Think through your
access patterns. Are they predictible? How many "random" requests? Can you
wait for them to get fetched? Of course, once you have fetched a "random"
request, it will be local, living in proxy cache. You can figure it out.
viz. webdav, I think it should do a "write through" to put any modified
content back on the "owning" web server (not the local proxy). This will
take time and use up bandwidth (and interfere with other communications)
but it is probably the "right thing to do" to manage the originals.
--
Juhan Leemet
Logicognosis, Inc.