Networking Forums

Networking Forums > Computer Networking > Linux Networking > Distributed Database

Reply
Thread Tools Display Modes

Distributed Database

 
 
Alan Connor
Guest
Posts: n/a

 
      12-15-2003, 11:00 PM

Let's say that I have a database distributed among 20 boxes, all at different
locations with different IPs.

The full database takes only 5 of these boxes, so there are 3 mirrors.

Can someone tell me the best way to set things up so that a client can
just enter a URL and access whichever group of 5 boxes currently isn't
being used to capacity? How to connect those 5 boxes together so that
they are effectively ONE box?

(wget will be used to send a search string to the db and retrieve the results)

The information in the database is just text, and creating a user application
to make the system more efficient is possible.

A few keywords to search would be enough for now. I just don't know
where to start.


Thanks a lot.


AC

 
Reply With Quote
 
 
 
 
Jem Berkes
Guest
Posts: n/a

 
      12-15-2003, 11:22 PM
> Let's say that I have a database distributed among 20 boxes, all at
> different locations with different IPs.
>
> The full database takes only 5 of these boxes, so there are 3 mirrors.
>
> Can someone tell me the best way to set things up so that a client can
> just enter a URL and access whichever group of 5 boxes currently
> isn't being used to capacity? How to connect those 5 boxes together
> so that they are effectively ONE box?


To answer your first question, using DNS you can assign multiple IP
addresses to a single host name. For instance, if IPs A, B, C, D, E are
address records for db.example.com' then accessing http://db.example.com
will send traffic to one of those 5 IP addresses in a load-balanced
fashion. Typically the IP you get is random (I think?)

As for connecting the 5 independant hosts together so that they are
effectively one, maybe rsync is the way to go to synchronize the
filesystems.

--
Jem Berkes
http://www.sysdesign.ca/
 
Reply With Quote
 
Alan Connor
Guest
Posts: n/a

 
      12-15-2003, 11:59 PM
On 16 Dec 2003 00:22:17 GMT, Jem Berkes <(E-Mail Removed)9__org> wrote:
>
>
>> Let's say that I have a database distributed among 20 boxes, all at
>> different locations with different IPs.
>>
>> The full database takes only 5 of these boxes, so there are 3 mirrors.
>>
>> Can someone tell me the best way to set things up so that a client can
>> just enter a URL and access whichever group of 5 boxes currently
>> isn't being used to capacity? How to connect those 5 boxes together
>> so that they are effectively ONE box?

>
> To answer your first question, using DNS you can assign multiple IP
> addresses to a single host name. For instance, if IPs A, B, C, D, E are
> address records for db.example.com' then accessing http://db.example.com
> will send traffic to one of those 5 IP addresses in a load-balanced
> fashion. Typically the IP you get is random (I think?)
>


DNS keeps track of the load on the individual IPs?

So each mirror would have one box whose IP was listed by DNS.

What if the listed box in a mirror went off line and was
replaced by another one? Can you change the DNS records quickly?

Can the mirror, regardless of the IPs composing it at the moment,
be assigned an IP?

> As for connecting the 5 independant hosts together so that they are
> effectively one, maybe rsync is the way to go to synchronize the
> filesystems.
>


rsync looks interesting. Would probably work.

> --
> Jem Berkes
> http://www.sysdesign.ca/


Perfect. And no wasted words, as usual. Thanks Jem.

AC

 
Reply With Quote
 
Jem Berkes
Guest
Posts: n/a

 
      12-16-2003, 12:33 AM
> DNS keeps track of the load on the individual IPs?

This is not inherent to DNS, so you would have to find a way to supply this
data. For instance: your DNS server could have a default list of IPs {a, b,
c} that are supplied in response to DNS queries. Let's say you have scripts
running on the individual boxes that send an "OK" message to your DNS
server every 15 minutes. The boxes might not send this "OK" if they are
heavily loaded, or crashed. Then the DNS server could remove that ip from
the list ensuring that this box isn't queried by clients.

I don't know too much about configuring BIND or djbdns (the two major UNIX
DNS servers) but I know this scheme is used. Something like it is described
by this service, which I use for my own domains.
http://www.zoneedit.com/doc/faq.html#fo

> So each mirror would have one box whose IP was listed by DNS.


Right.

> What if the listed box in a mirror went off line and was
> replaced by another one? Can you change the DNS records quickly?


DNS records have a TTL (time to live) that can realistically be anywhere
from 10 minutes to several hours. You could change the IPs that quickly.

> Can the mirror, regardless of the IPs composing it at the moment,
> be assigned an IP?


I'm not entirely sure what you mean by this... but think of DNS as just
providing a lookup mechanism for host name -> one or more IP addresses.

--
Jem Berkes
http://www.sysdesign.ca/
 
Reply With Quote
 
Alan Connor
Guest
Posts: n/a

 
      12-16-2003, 01:28 AM
On 16 Dec 2003 01:33:02 GMT, Jem Berkes <(E-Mail Removed)9__org> wrote:
>
>
>> DNS keeps track of the load on the individual IPs?

>
> This is not inherent to DNS, so you would have to find a way to supply this
> data. For instance: your DNS server could have a default list of IPs {a, b,
> c} that are supplied in response to DNS queries. Let's say you have scripts
> running on the individual boxes that send an "OK" message to your DNS
> server every 15 minutes. The boxes might not send this "OK" if they are
> heavily loaded, or crashed. Then the DNS server could remove that ip from
> the list ensuring that this box isn't queried by clients.
>
> I don't know too much about configuring BIND or djbdns (the two major UNIX
> DNS servers) but I know this scheme is used. Something like it is described
> by this service, which I use for my own domains.
> http://www.zoneedit.com/doc/faq.html#fo
>
>> So each mirror would have one box whose IP was listed by DNS.

>
> Right.
>
>> What if the listed box in a mirror went off line and was
>> replaced by another one? Can you change the DNS records quickly?

>
> DNS records have a TTL (time to live) that can realistically be anywhere
> from 10 minutes to several hours. You could change the IPs that quickly.
>
>> Can the mirror, regardless of the IPs composing it at the moment,
>> be assigned an IP?

>
> I'm not entirely sure what you mean by this... but think of DNS as just
> providing a lookup mechanism for host name -> one or more IP addresses.
>
> --
> Jem Berkes
> http://www.sysdesign.ca/


Okay. 10 minutes is quick enough. The DNS server could call a program
that kept track of how much work had been assigned to any particular
mirror.

So the client enters the search string which a simple program divides into
5 seperate sub-strings, and this is sent to whatever box the DNS server says
is available, with wget. The boxes all run a stripped-down apache.

It takes one of the sub-strings and searches its segment of the db, sending the
other 4 sub-strings to the other boxes in its mirror which use their db
program to search the segment they have and send any results back to the
listed server via rsync, which concatenates them into a simple html document
and sends them directly to the user.

Sounding sensible?

AC


 
Reply With Quote
 
Alan Connor
Guest
Posts: n/a

 
      12-16-2003, 01:58 AM
On 16 Dec 2003 01:33:02 GMT, Jem Berkes <(E-Mail Removed)9__org> wrote:
>
>
>> DNS keeps track of the load on the individual IPs?

>
> This is not inherent to DNS, so you would have to find a way to supply this
> data. For instance: your DNS server could have a default list of IPs {a, b,
> c} that are supplied in response to DNS queries. Let's say you have scripts
> running on the individual boxes that send an "OK" message to your DNS
> server every 15 minutes. The boxes might not send this "OK" if they are
> heavily loaded, or crashed. Then the DNS server could remove that ip from
> the list ensuring that this box isn't queried by clients.
>
> I don't know too much about configuring BIND or djbdns (the two major UNIX
> DNS servers) but I know this scheme is used. Something like it is described
> by this service, which I use for my own domains.
> http://www.zoneedit.com/doc/faq.html#fo
>
>> So each mirror would have one box whose IP was listed by DNS.

>
> Right.
>
>> What if the listed box in a mirror went off line and was
>> replaced by another one? Can you change the DNS records quickly?

>
> DNS records have a TTL (time to live) that can realistically be anywhere
> from 10 minutes to several hours. You could change the IPs that quickly.
>
>> Can the mirror, regardless of the IPs composing it at the moment,
>> be assigned an IP?

>
> I'm not entirely sure what you mean by this... but think of DNS as just
> providing a lookup mechanism for host name -> one or more IP addresses.
>
> --
> Jem Berkes
> http://www.sysdesign.ca/


Okay. 10 minutes is quick enough. The DNS server could call a program
that kept track of how much work had been assigned to any particular
mirror.

So the client enters the search string which a simple program divides into
5 seperate sub-strings, and this is sent to whatever box the DNS server says
is available, with wget. The listed box is running a stripped-down apache.

It takes one of the sub-strings and searches its segment of the db, sending the
other 4 sub-strings to the other boxes in its mirror which use their db
program to search the segment they have and send any results back to the
listed server via rsync, which concatenates them into a simple html document
and sends them directly to the user.

Sounding sensible?

AC


 
Reply With Quote
 
Neil Horman
Guest
Posts: n/a

 
      12-16-2003, 12:14 PM
Alan Connor wrote:
> Let's say that I have a database distributed among 20 boxes, all at different
> locations with different IPs.
>
> The full database takes only 5 of these boxes, so there are 3 mirrors.
>
> Can someone tell me the best way to set things up so that a client can
> just enter a URL and access whichever group of 5 boxes currently isn't
> being used to capacity? How to connect those 5 boxes together so that
> they are effectively ONE box?
>

DNS is a cheap solution. Several name server packages provide features
(or at least have patches available) to allow you to specify some level
of load balancing policy, be it a simple round robin approach, or
something a little more complex. It takes a little work to set up, and
it can get a little askew, as once a DNS lookup is cached by a client
there is no rebalancing available to the system

Alternatively there are professional hardware/software solutions to do
this. IBM offers websphere (I think) which enables this sort of
transparent load balancing feature, and it appears as one IP address to
the outside world, so multiple accesses from the same client can be
reblanaced as needed. Expect to pay corporate prices for this though.

HTH
Neil

--
Neil Horman
Red Hat, Inc., http://people.redhat.com/nhorman
gpg keyid: 1024D / 0x92A74FA1, http://www.keyserver.net

 
Reply With Quote
 
Alan Connor
Guest
Posts: n/a

 
      12-16-2003, 08:58 PM
On Tue, 16 Dec 2003 08:14:28 -0500, Neil Horman <(E-Mail Removed)> wrote:
>
>
> Alan Connor wrote:
>> Let's say that I have a database distributed among 20 boxes, all at different
>> locations with different IPs.
>>
>> The full database takes only 5 of these boxes, so there are 3 mirrors.
>>
>> Can someone tell me the best way to set things up so that a client can
>> just enter a URL and access whichever group of 5 boxes currently isn't
>> being used to capacity? How to connect those 5 boxes together so that
>> they are effectively ONE box?
>>

> DNS is a cheap solution. Several name server packages provide features
> (or at least have patches available) to allow you to specify some level
> of load balancing policy, be it a simple round robin approach, or
> something a little more complex. It takes a little work to set up, and
> it can get a little askew, as once a DNS lookup is cached by a client
> there is no rebalancing available to the system
>
> Alternatively there are professional hardware/software solutions to do
> this. IBM offers websphere (I think) which enables this sort of
> transparent load balancing feature, and it appears as one IP address to
> the outside world, so multiple accesses from the same client can be
> reblanaced as needed. Expect to pay corporate prices for this though.
>
> HTH
> Neil
>


Indeed it does. Thanks Neil.


The simplest solution at this point seems to be to have a seperate application
keep track of the work sent to a particular mirror, and to temporarily
remove that IP from the available list when it has all that it can handle.


AC

 
Reply With Quote
 
P Gentry
Guest
Posts: n/a

 
      12-17-2003, 03:02 AM
Alan Connor <(E-Mail Removed)> wrote in message news:<reLDb.7344$(E-Mail Removed) link.net>...
> On Tue, 16 Dec 2003 08:14:28 -0500, Neil Horman <(E-Mail Removed)> wrote:
> >
> >
> > Alan Connor wrote:
> >> Let's say that I have a database distributed among 20 boxes, all at different
> >> locations with different IPs.
> >>
> >> The full database takes only 5 of these boxes, so there are 3 mirrors.
> >>
> >> Can someone tell me the best way to set things up so that a client can
> >> just enter a URL and access whichever group of 5 boxes currently isn't
> >> being used to capacity? How to connect those 5 boxes together so that
> >> they are effectively ONE box?
> >>

> > DNS is a cheap solution. Several name server packages provide features
> > (or at least have patches available) to allow you to specify some level
> > of load balancing policy, be it a simple round robin approach, or
> > something a little more complex. It takes a little work to set up, and
> > it can get a little askew, as once a DNS lookup is cached by a client
> > there is no rebalancing available to the system
> >
> > Alternatively there are professional hardware/software solutions to do
> > this. IBM offers websphere (I think) which enables this sort of
> > transparent load balancing feature, and it appears as one IP address to
> > the outside world, so multiple accesses from the same client can be
> > reblanaced as needed. Expect to pay corporate prices for this though.
> >
> > HTH
> > Neil
> >

>
> Indeed it does. Thanks Neil.
>
>
> The simplest solution at this point seems to be to have a seperate application
> keep track of the work sent to a particular mirror, and to temporarily
> remove that IP from the available list when it has all that it can handle.
>
>
> AC


Forgive me, but I am easily confused. Are you saying that it takes 5
separate boxes to _hold_ all the db data? Located in different
locations (thus different IP's, different nets). Does each site have
5 boxes dedicated to the db? And what's with the mirrors? Are these
replicated db's or do they need to be up to date within say 10
minutes? I'm just trying to get the spec right. Like I said, I'm
easily confused.

Depending on the nature of your data and db, the load you expect, the
nature of the traffic on the different nets and how much work you're
willing to do youself or pay for, you may want to look at ZEO. Did
not look through old mail, but I seem to recall some one or more
people using it in a fashion somewhat like what I _think_ you're
seeking. They may have used RSS to push ZEO updates out to different
sites. I recall talk of schemes trying to achieve some load
balancing, but to tell the truth I'm not clear what they or what you
really want to achieve in this regard.

Anyway, it may be worth a look:
http://zope.org/Products/ZEO/ZEOFactSheet
http://mail.zope.org/pipermail/zodb-dev/ (mailiing list archive
with some ZEO)

Note that to use ZEO you do NOT have to use Zope. Technically, you
don't _have_ to use the ZODB, but most people do because of its
design, integration with ZEO, and "ease" with which it can interface
with rdbs's if you need that kind of data storage. And if you are
willing to work with Python, there's probably a complete solution
(with hand built glue code and gap fillers) . It might give you a
piece of the puzzle, anyway.

hth,
prg
email above disabled
 
Reply With Quote
 
Alan Connor
Guest
Posts: n/a

 
      12-17-2003, 04:39 AM
On 16 Dec 2003 20:02:26 -0800, P Gentry <(E-Mail Removed)> wrote:
>>


<snip>

>> AC

>
> Forgive me, but I am easily confused. Are you saying that it takes 5
> separate boxes to _hold_ all the db data? Located in different
> locations (thus different IP's, different nets). Does each site have
> 5 boxes dedicated to the db? And what's with the mirrors? Are these
> replicated db's or do they need to be up to date within say 10
> minutes? I'm just trying to get the spec right. Like I said, I'm
> easily confused.
>


All the boxes are spread out across several states. It takes 5 to hold
the entire database, so each has only 1/5 of it.

Each site has one box dedicated to the db.

The 'mirrors' all of them must be kept up-to-date at all times, at least
the ones that are on line.

The boxes comprising any particular 'mirror' can and will change. It will
be whichever ones are available.

> Depending on the nature of your data and db, the load you expect, the
> nature of the traffic on the different nets and how much work you're
> willing to do youself or pay for, you may want to look at ZEO. Did
> not look through old mail, but I seem to recall some one or more
> people using it in a fashion somewhat like what I _think_ you're
> seeking. They may have used RSS to push ZEO updates out to different
> sites. I recall talk of schemes trying to achieve some load
> balancing, but to tell the truth I'm not clear what they or what you
> really want to achieve in this regard.
>


I'll have a look at those apps.

The database will contain a directory of websites that will return a
description of the site in plain text, and technical specs like size
and whether it uses frames and whether cookies are required, and so
forth.

Like a search engine but more usable 'hits'.


A card catalog, really.

> Anyway, it may be worth a look:
> http://zope.org/Products/ZEO/ZEOFactSheet
> http://mail.zope.org/pipermail/zodb-dev/ (mailiing list archive
> with some ZEO)
>
> Note that to use ZEO you do NOT have to use Zope. Technically, you
> don't _have_ to use the ZODB, but most people do because of its
> design, integration with ZEO, and "ease" with which it can interface
> with rdbs's if you need that kind of data storage. And if you are
> willing to work with Python, there's probably a complete solution
> (with hand built glue code and gap fillers) . It might give you a
> piece of the puzzle, anyway.
>


Great! Thanks a bunch. I'm working on a front end for wget that will
submit the search strings to the listed (DNS) member of any of the
mirrors, where a webserver will be running. An app there will direct
the search programs on all of the boxes, or those that need to be
searched.

There'll be at least one DNS server up which will be aware of the load
on any particular 'mirror' and route accordingly.

The listed box, which has the webserver, will send the parts of the search
string to the relevant segment-boxes which will return any hits via rsync
and they'll be concatenated and sent to the enduser to open in their browser.

With everything compressed and not much data actually being involved, it
should be pretty quick.

Lots of other details of course, but that's the current outline of
a plan that could change tomorrow.

> hth,
> prg
> email above disabled



On my way to ZEO....


AC

 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
distributed measurement problem Shashank Linux Networking 3 11-11-2008 07:44 AM
Distributed computing with routers... William R. Walsh Wireless Internet 0 09-07-2007 04:06 AM
Output DHCP database to SQL database? =?Utf-8?B?Q2hyaXN0aWFuV2lja2hhbQ==?= Windows Networking 1 01-26-2005 01:32 PM
Distributed Laptop Network Michael Windows Networking 5 09-09-2004 03:11 PM
Distributed Filesystem Christian Kier Linux Networking 3 12-10-2003 07:57 PM



1 2 3 4 5 6 7 8 9 10 11