LVS-HOWTO: Ipvsadm

7.1 Using ipvsadm

You use ipvsadm from the command line (or in rc files) to setup: -

services/servers that the director directs (eg http goes to all real-servers, while ftp goes only to real-server_1).
weighting given to each real-server - useful if some servers are faster than others.
scheduling algorithm
- round robin (rr), weighted round robin (wrr) - new connections are assigned to each real-server in turn
- least connected (lc), weighted least connection (wlc) - new connections go to real-server with the least number of connections (this is not neccessarily the least busy machine but is a step in that direction). Note: Doug Bagley doug@deja.com points out that *lc schedulers will not work properly if a particular real-server is used in two different LVSs.
- newer scheduling methods designed for webcaches (DH, destination hash from code developed by Thomas Proell proellt@gmx.de) and multiple firewalls (SH, source hash by Henrik Nordstrom hno@safecore.se).

any of these will do for a test setup (rr will cycle connections to each real-server in turn, allowing you to check that all real-servers are functioning in the LVS).

You use ipvsadm to

add services: add a service with weight >0
shutdown services: set the weight to 0. This allows current connections to continue, untill they disconnect or expire, but will not allow new connections. When there are no connectionsi remaining, you can reconfigure your service or turn off the server.
delete services: this stops traffic for the service, but the entry in the connection table is not deleted till it times out. This allows deletion followed shortly thereafter by adding back the service, to not affect established (but quiescent) connections.
save the ipsvadm settings of your lvs, with ipvsadm-save
restore the ipvsadm settings or your lvs, with ipvsadm-restore

7.2 Compile ipvsadm for each new ipvs

Compile and install ipvsadm on the director using the supplied Makefile. You can optionally compile ipvsadm with popt libraries, which allows ipvsadm to handle more complicated arguments on the command line. The default compile uses static popt libraries. If your libpopt.a is too old, your ipvsadm will segv. (I'm compiling with a newer dynamic libpopt).

Since you compile ipvs and ipvsadm independantly and you cannot compile ipvsadm until you have patched the kernel headers, a common mistake is to compile the kernel and reboot, forgetting to compile/install ipvsadm.

Unfortunately there is only rudimentary version detection code into ipvs/ipvsadm. If you have a mismatched ipvs/ipvsadm pair, many times you won't notice any problems, as any particular version of ipvsadm will work with a wide range of patched kernels. Usually with 2.2.x kernels, if the ipvs/ipvsadm versions mismatch, you'll get wierd but non-obvious errors about not being able to install your LVS. Other possibilities are that the output of ipvsadm -L will have IP's that are clearly not IPs (or not the IP's you put in) and ports that are all wrong. There was a change in the /proc file system for ipvs about 2.2.14 which caused problems for anyone with a mismatched ipvsadm/ipvs. The ipvsadm from different kernel series (2.2/2.4) do not recognise the ipvs kernel patches from the other series (they appear to not be patched for ipvs).

The later 2.2.x ipvsadms know the minimum version of ipvs that they'll run on, and will complain about a mismatch. They don't know the maximum version (produced presumably some time in the future) /outsithat they will run on. This protects you against the unlikely event of installing a new 2.2.x version of ipvsadm on an older version of ipvs, but will not protect you against the more likely scenerio where you forget to compile ipvsadm after building your kernel. The ipvsadm maintainers are aware of the problem and aren't going to fix it sometime.

If you didn't even apply the kernel patches for ipvs, then ipvsadm will complain about missing modules.

"Ty Beede" tybeede@metrolist.net writes:

 > on a slackware 4.0 machine I went to compile ipvsadm and it gave
 > me an error indicating that the iphdr type was undefined and
 > it didn't like that when it saw the ip_fw.h header file.  I just
 > #included <linux/ip.h> in ipvsadm.c, which is where the iphdr
 > #structure is defined and everything went ok

From: Doug Bagley doug@deja.com

   The reason that it fails "out of the box" is because fwp_iph's
   type definition (struct iphdr) was #ifdef'd out in <linux/ip_fw.h>
   (and not included anywhere else) since the symbol __KERNEL_ was
   undefined. Including <linux/ip.h> before <linux/ip_fw.h> in the .c
   file did the trick.

7.3 schedulers

On receiving a connect request from a client, the director assigns a real-server to the client based on a "schedule". The scheduler type is set with ipvsadm. The schedulers available are

rr, wrr: round robin, weighted round robin
lc, wlc: least connection, weighted least connection (the director has a table with the number of connections for each real-server).
persistent connection
LBLC: a persistent memory algorythm
dh: destination hash
sh: source hash

The rr,wrr,lc,wlc schedulers should all work similarly for identical real-servers with identical services. The lc scheduler will better handle situations where machines are brought down and up again (see thundering herd problem). If the real-servers are offering different services and some have clients connected for a long time while others are connected for a short time, or some are compute bound, while others are network bound, then none of the schedulers will do a good job of distributing the load between the real-servers. LVS doesn't have any load monitoring of the real-servers. Figuring out a way of doing this that will work for a range of different types of services isn't simple (see the mailing list archives for endless discussion on remote load monitoring).

The LBLC code (from Julian) and the dh scheduler (from Thomas Proell) are designed for web caching real-servers (e.g. squids). For normal LVS services (eg ftp, http), the content offered by each real-server is the same and it doesn't matter which real-server the client is connected to. For a web cache, after the first fetch has been made, the web caches have different content. As more pages are fetched, the contents of the web caches will diverge. Since the web caches will be setup as peers, they can communicate by ICP (internet caching protocol) and find the cache(s) with the required page. This is faster than fetching the page from the original webserver. However, it would be better after the first fetch of a page from http://www.foo.com/* , for all subsequent clients wanting a page from http://www.foo.com/ to be connected to that real-server.

The original method for handling this was to make connections to the real-servers persistent, so that all fetches from a client went to the same real-server.

The -dh algorythm makes a hash from the target IP and all requests to that IP will be sent to the same real-server. This means that content from a URL will not be retrieved multiple times from the remote server. The real-servers (eg squids in this case) will each be retreiving content from different URLs.

The -sh (source hash) scheduler is for directors with multiple firewalls. It's from Henrik Nordstrom, who is involved with developing web caches. The director hashes on the MAC address of the firewall.

Henrik Nordstrom 14 Feb 2001

Here is a small patch to make LVS keep the MARK, and have return traffic inherit the mark.
We use this for routing purposes on a multihomed LVS server, to have return traffic routed back the same way as from where it was received. What we do is that we set the mark in the iptables mangle chain depending on source interface, and in the routing table use this mark to have return traffic routed back in the same (opposite) direction.
The patch also moves the priority of LVS INPUT hook back to infront of iptables filter hook, this to be able to filter the traffic not picked up by LVS but matchin it's service definitions. We are not (yet) interested of filtering traffic to the virtual servers, but very interested in filtering what traffic reaches the Linux LVS-box itself.
(Julian) - who uses NFC_ALTERED ?
Netfilter. The packet is accepted by the hook but altered (mark changed).
(Julian) Give us an example (with dummy addresses) for setup that require such fwmark assignments.
For a start you need a LVS setup with more than one real interface receiving client traffic for this to be of any use. Some clients (due to routing outside the LVS server) comes in on one interface, other clients on another interface. In this setup you might not want to have a equally complex routing table on the actual LVS server itself.

Regarding iptables / ipvs I currently "only" have three main issues.

As the "INPUT" traffic bypasses most normal routes, the iptables conntrack will get quite confused by return traffic..

Sessions will be tracked twice. Both by iptables conntrack and by IPVS.

There is no obvious choice if IPVS LOCAL_IN sould be placed before or after iptables filter hook. Having it after enables the use of many fancy iptables options, but instead requires one to have rules in iptables for allowing ipvs traffic, and any mismatches (either in rulesets or IPVS operation) will cause the packets to actually hit the IP interface of the LVS server which in most cases is not what was intended.

From: Wensong Zhang wensong@gnuchina.org 16 Feb 2001

Please see "man ipvsadm" for short description of DH and SH schedulers. I think some examples to use those two schedulers.

Example1: cache cluster shared by several load balancers.
                Internet
                |
                |------cache array
                |
                |-----------------------------
                   |                |
                   DH               DH
                   |                |
                 Access            Access
                 Network1          Network2
The DH scheduler can keep the two load balancer redirect requests destined for the same IP address to the same cache server. If the server is dead or overloaded, the load balancer can use cache_bypass feature to send requests to the original server directly. (Make sure that the cache servers are added in the two load balancers in the same order)
Note that the DH development is inspired by the consistent hashing scheduler patch from Thomas Proell proellt@gmx.de

Example2: Firewall Load Balancing
                      |-- FW1 --|
  Internet ----- SH --|         |-- DH -- Protected Network
                      |-- FW2 --|
Make sure that the firewall boxes are added in the load balancers in the same order. Then, request packets of a session are sent to a firewall, e.g. FW1, the DH can forward the response packets from protected network to the FW1 too. However, I don't have enough hardware to test this setup myself. Please let me know if any of you make it work for you. :)

For initial discussions on the -dh and -sh scheduler see on the mailing list under "some info for DH and SH schedulers" and "LVS with mark tracking".

7.4 does rr equally distribute the load?

I ran the polygraph simple.pg test on a VS-NAT LVS with 4 realservers using rr scheduling. Since the responses from the real-servers should average out I would have expected the number of connection and load average on the real-servers to be equally distributed over the real-servers.

Here's the output of ipvsadm shortly after the number of connections had reached steady state (about 5 mins).

IP Virtual Server version 0.2.12 (size=16384)                  
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph rr
  -> doc.mack.net:polygraph         Masq    1      0          883       
  -> dopey.mack.net:polygraph       Masq    1      0          924       
  -> bashfull.mack.net:polygraph    Masq    1      0          1186      
  -> sneezy.mack.net:polygraph      Masq    1      0          982

The number of connections (all in TIME_WAIT) at the real-servers was different for each (otherwise apparently identical) real-server and was in the range 450-1000 (measured with netstat -an | grep $polygraph_port |wc ) and varied about 10% over a long period.

Repeating the run using "lc" scheduling, the InActConn remains constant.

IP Virtual Server version 0.2.12 (size=16384)                  
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph lc
  -> doc.mack.net:polygraph         Masq    1      0          994       
  -> dopey.mack.net:polygraph       Masq    1      0          994       
  -> bashfull.mack.net:polygraph    Masq    1      0          994       
  -> sneezy.mack.net:polygraph      Masq    1      0          993

Still the number of connections (all in TIME_WAIT) at the real-servers did not change (ranged from 450-1000).

Joe, 14 May 2001

according to the ipvsadm man page, for "lc" scheduling, the new connections are assigned according to the number of "active connections". Is this the same as "ActConn" in the output of ipvsadm?

If the number of "active connections" used to determine the scheduling is "ActConn", then for services which don't maintain connections, the scheduler won't have much information, just "0" for all real-servers?

Julian

The formula is: ActConn * K + InActConn
where K can be 32 to 50, I don't remember the last used value.
So, it is not only the active conns, this will break UDP.

I've been running the polygraph simple.pg test over the weekend using rr scheduling on what (AFAIK) are 4 identical real-servers in a VS-NAT LVS. There are no ActConn and a large number of InActConn. Presumably the client makes a new connection for each request.

The implicit persistence of TCP connection reuse can cause such side effects even for RR. When the setup includes small number of hosts and the used rate is big enough to reuse the client's port, the LVS detects existing connections and new connections are not created. This is the reason you can see some of the rs not to be used at all, even for such method as RR.

the client is using ports from 1025-4999 (has about 2000 open at one time) and it's not going above the 4999 barrier. ipvsadm shows a constant InActConn of 990-995 for all realservers, but the number of connections on each of the real-servers (netstat -an) ranges from 400-900.

So if the client is reusing ports (I thought you always incremented the port by 1 till you got to 64k and then it rolled over again), LVS won't create a new entry in the hash table if the old one hasn't expired?

Yes, it seems you have (5000-1024) connections that never expire in LVS.

Presumably because the director doesn't know the number of connections at the real-servers (it only has the number of entries in its tables), and because even apparently identical real-servers aren't identical (the hardware here is the same, but I set them up at different times, presumably not all the files and time outs are the same), the throughput of different real-servers may not be the same.

7.5 persistent connections

Unfortunately the term "persistence" has 2 meanings in setting up an LVS. There is "persistent connection" a term used for connecting to webservers and databases and "persistent connection" used in LVS. These are quite different.

You need LVS persistence if you need the client to always connect to the same real-server. You'll need this if

you need session information to stay on one machine, eg SSL keys for https
you are using a multiport protocol. Examples are
- ftp using ports 20,21. (For VS-NAT, this is handled by the ip_masq_ftp module). You need persistence for VS-DR and VS-Tun
- e-commerce sites where a customer on port 80 has to transfer to port 443 on the same machine to pay for an order.

From: bobby.moore@worldspan.com wrote:

> What does the term 'persistence' mean for IPVS?

Persistant connection outside of LVS is described in http persistent connection and is an application level protocol.

In normal http (or database connection), after the server has sent it's reply, it shuts down the tcpip connection. This makes your session with the server stateless - it has no idea what you've previously done on the machine. If the exchanges are small (eg 1 packet), then as well you've gone through a lot of handshakes to exchange one packet as your payload. At connect time the client and server notify each other that they support persistent connection and the server uses an algorithm to determine when to drop the connection (timeout, needs handle...). The client can drop the connection anytime they like.

The persistant (or sticky) connection of LVS is a layer 4 protocol which makes connections from client(s) on an IP (or network) go to the same real-server. This is not the same as the persistent connection described above and could alternately be described as connection affinity or port affinity.

In an LVS if you are accumulting state (eg a shopping cart), then you want the client to connect to the same server on port 443 as was used for the port 80 connection. If you are doing passive ftp, then you want the second connection to come back to the same real-server. In these cases you want the connection affinity provided by the LVS persistent connection option.

From: Wensong Zhang wensong@gnuchina.org 11 Jan 2001

The working principle of persistence in LVS is as follows:

a persistent template is used to keep the persistence between the client and the server.
when the first connection from a client, the LVS box will select a server according to the scheduling algoriths, then create a persistent template and the connection entry. the control of the connection entry is the template.
The late connections from the clients will be forwarded to the same server, as long as the template doesn't expire. The control of their connection entries are the template.
If the template has its controlled connections, it won't expire.
If the template has no controlled connections, it expires in its own time.

You can trace your system in the following way. For example:


[root@kangaroo /root]# ipvsadm -ln
IP Virtual Server version 1.0.3 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  172.26.20.118:80 wlc persistent 360
  -> 172.26.20.91:80             Route   1      0          0
  -> 172.26.20.90:80             Route   1      0          0
TCP  172.26.20.118:23 wlc persistent 360
  -> 172.26.20.90:23             Route   1      0          0
  -> 172.26.20.91:23             Route   1      0          0

[root@kangaroo /root]# ipchains -L -M -n
IP masquerading entries
prot expire   source               destination          ports
TCP  02:46.79 172.26.20.90         172.26.20.222        23 (23) -> 0

Although there is no connection, the template isn't expired. So, new connections from the client 172.26.20.222 will be forwarded to the server 172.26.20.90.

persistent client connection, pcc (for kernel =<2.2.10)

All connections from an IP go to same real-server. Timeout for inactive connections is 360 sec. pcc is designed for https and cookie serving. With ppc, after the first connection (say to port 80) any subsequent connections requests from the same client but from another port (eg 443) will be sent to the same real-server. The problem with this is that about 25% of the people on the internet have the same IP (AOL customers are connected to the internet via a server in Virginnia, USA). If you have pcc set, then after the first client connects from AOL, then all subsequent connections from AOL will go to the same real-server, until the last AOL client disconnects. This effect will override attempts to distribute the load between real-servers.

persistent port connection (ppc) (for kernel >= 2.2.12)

With kernel 2.2.12, the persistent connection feature has been changed from a scheduling algorythm (you get rr|wrr|lc|wlc|pcc) to a switch (you can have persistent connection with rr|wrr|lc|wlc). If you do not select a scheduling algorithm when asking for a persistent connection, ipvsadm will default to wlc.

The difference between pcc and ppc is probably of minor consequence to the LVS admin (if you want persistent connection, you have to have it and you don't care how you got it). With ppc, connections are assigned on a port by port basis. With ppc, if both port 80 and 443 were persistant, then connections from the same client would not neccessarily go to the same real-server. This solves the AOL problem.

If you are handing out cookies to a client on port 80 and they need to go to port 443 to give their credit card, you want them going to the same real-server. There is no way to make ports sticky by groups (or pairs), so for the moment you emulate the pcc connection by using port 0.

Because of the way proxies can work, a client can come from one IP for one connection (eg port 80) and from another IP for the next connection (eg port 443). For this you can make a netmask of IPs sticky.

(valery brasseur) > I have seen some discussion about "proxy farm" such as AOL or T-Online,

(Wensong) If you want to build a persistent proxy cluster, you just need set a LVS box at the front of all proxy servers, and use the persistent port option in the ipvsadm commands. BTW, you can have a look at http://wwwcache.ja.net/JanetServices/PilotServices.html for how to build a big JANET cache cluster using LVS.

If you want to build a persistent web service but some proxy farms are non-persistent at client side, then you can use the persistent granularity so that clients can be grouped, for example you use 255.255.255.0 mask, the clients from the same /24 network will go to the same server.

While persistence can be used for services that require multiple ports eg ftp/ftp-data, http/https it can be useful for ssl services.

Here's an example of using persistence granularity (from Ratz 3 Jan 2001). The -M 255.255.255.255 sets up /32 granularity. Here port 80 and port 443 are being linked by fwmarks.

ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 80 -m 1 -l
ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 443 -m 1 -l
ipvsadm -A -f 1 -s wlc -p 333 -M 255.255.255.255
ipvsadm -a -f 1 -r 192.168.1.1 -g -w 1
ipvsadm -a -f 1 -r 192.168.1.2 -g -w 1

For more information on persistence granularity see the section on persistence granularity with fwmark. It's use for fwmark is the same as for VIP.

Francis Corouge wrote:

I made a VS-DR lvs. All services work well, but with IE 4.1 on secured connection, pages are received randomly. when you make several requests, sometime the page is displayed, but sometimes a popup error message is displayed

        Internet Explorer can't open your Internet Site <url>
        An error occured with the secured connexion.

I did not test with other versions of IE, but netscape works fine. It works when I connect directly to the real server (real-server disconnected from the LVS, and the VIP on the real-server allowed to arp).

Julian: Is the https service created persistent? ipvsadm -p

Why does persistence fix this problem? (also see http://www.linuxvirtualserver.org/persistence.html)

I assume the problem is in the way SSL is working: cached keys, etc. Without persistence configured, the SSL connections break when they hit another real server.

> what is (or might be) different about IE4 and Netscape?

Maybe in the way the bugs are encoded. But I'm not sure how the SSL requests are performed. It depends on that too.

Example 1. https only

This is done with persistent connection.

lvs_dr.conf config file excerpt

SERVICE=t https ppc 192.168.1.1

output from ipvsadm


ipvsadm settings
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  ssl.mack.net:https wlc persistent 360
  -> di.mack.net:https           Route   1      0          0

Example 2. All ports sticky, timeout 30mins, wrr scheduling

lvs_dr.conf config file excerpt

SERVICE=t 0 wrr ppc -t 1800 192.168.1.1

which specifies tcp (t), service (all ports = 0), weighted round robin scheduling (wrr), timeout 1800 secs (-t 1800), to realserser 192.168.1.1.

Here's the code generated by configure

#ppc persistent connection, timeout 1800 sec
/sbin/ipvsadm -A -t 192.168.1.110:0 -s wrr -p 1800
echo "adding service 0 to real-server 192.168.1.1 using connection type dr weight 1"
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.1.1 -g -w 1

here's the output of ipvsadm

# ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  ssl.mack.net:https wrr persistent 1800
  -> di.mack.net:https           Route   1      0          0

Problems: Removing persistant connections after a real-server crash

From: Patrick Kormann pkormann@datacomm.ch

I have the following problem: I have a direct routed 'cluster' of 4 proxies. My problem is that even if the proxy is taken out of the list of real servers, the persistent connection is still active, that means, that proxy is still used.

 
 Andres Reiner wrote:
 (...snip...)
 >> Now I found some strange behaviour using 'mon' for the
 >> high-availability. If a server goes down it is correctly removed from
 >> the routing table. BUT if a client did a request prior to the server's
 >> failure, it will still be directed to the failed server afterwards. I
 >> guess this got something to do with the persistent connection setting
 >> (which is used for the cold fusion applications/session variables).
 >>
 >> In my understanding the LVS should, if a routing entry is deleted, no
 >> longer direct clients to the failed server even if the persistent
 >> connection setting is used.
 >>
 >> Is there some option I missed or is it a bug ?

Wensong Zhang wrote:

 > No, you didn't miss anything and it is not a bug either. :)
 >
 > In the current design of LVS, the connection won't be drastically
 > removed but silently drop the packet once the destination of the
 > connection is down, because monitering software may marks the server
 > temporary down when the server is too busy or the monitering software
 > makes some errors. When the server is up, then the connection continues.
 > If server is not up for a while, then the client will timeout. One thing
 > is gauranteed that no new connections will be assigned to a server when
 > it is down. When the client reestablishs the connection (e.g. press
 > reload/refresh in the browser), a new server will be assigned.
 
jacob.rief@tis.at// wrote:

 >
 > Unfortunately I have the same problem as Andres (see below)
 > If I remove a real server from a list of persistent
 > virtual servers, this connection never times out. Not even
 > after the specified timeout has been reached. Only if I unset

(Wensong) The persistent template won't timeout until all its connections timeout. After all the connections from the same client connection expires, new connections can be assigned to one of the remaining servers. You can use "ipchains -M -L -n" (or netstat -M) to check the connection table.

 
 > persisency the connection will be redirected onto the remaining
 > real servers. Now if I turn on persistency again, a prevoiusly
 > attached client does not reconnect anymore - it seems as
 > if LVS remembers such clients. It does not even help, if I delete
 > the whole virtual service and restore it immediately, in the
 > hope to clear the persistency tables.
 > (ipvsadm -D -t <VIP>; ipvsadm -A -t <VIP> -p; ipvsadm -a -t <VIP> -R <alive
 >  real server>)
 > And it also does not help closing the browser and restarting it.
 > I run LVS in masquerading mode on a 2.2.13-kernel patched
 > with ipvs-0.9.5.
 > Would'nt it be a nice feature to flush the persistent client
 > connection table, and/or list all such connections?

(Wensong) There are several reasons that I didn't do it in the current code. One is that it is time-consuming to search a big table (maybe one million entries) to flush the connections destined for the dead server; the other is that the template won't expire until its connection expire, the client will be assigned to the same server as long as there is a connection not expired. Anyway, I will think about better way to solve this problem.

 (valery brasseur)
 > I would like to to load balancing base on cookie and/or URL,

 (Wensong)
 Have a look at http://www.LinuxVirtualServer.org/persistence.html :-)

(also see <ref id="cookie" name="cookie">)

 Jean-Francois Nadeau wrote:
 > I will use LVS to load balance web servers (Direct Routing and WRR algo).
 > I use persitency with a big timeout (10 minutes).
 > Many of our clients are behind big proxies and I fear this will
 > unbalance our cluster because of the persitent timeout.

 (Wensong)
 persistent virtual services may lead to the load imbalance among
 servers. Using some weight adapation approaches may help avoid that some
 servers are overloaded for a long time. When the server is overloaded,
 decrease its weight so that connections from new clients won't be sent
 to that server. When the server is underloaded, increase its weight.

 > Can we alter directly /proc/net/ip_masquerade ?

 No, it is not feasible, because directly modifying masq entries will
 break the established connection.

Persistent and regular services are possible on the same real-server.

If you setup a 2 real-server VS-DR LVS with persistence,

ipvsadm -A -t $VIP -p -s -rr
ipvsadm -a -t $VIP -R $realserver1 $VS_DR -w 1
ipvsadm -a -t $VIP -R $realserver2 $VS_DR -w 1

giving the ipvsadm output

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> bashfull.mack.net:0         Route   1      0          0         
  -> sneezy.mack.net:0           Route   1      0          0

then (as expected) a client can connect to any service on the real-servers (always getting the same real-server).

If you now add an entry for telnet to both real-servers, (you can run these next instructions before or after the 3 lines immediately above)

ipvsadm -A -t $VIP:telnet -s -rr
ipvsadm -a -t $VIP:telnet -R $realserver1 $VS_DR -w 1
ipvsadm -a -t $VIP:telnet -R $realserver2 $VS_DR -w 1

giving the ipvsadm output

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> bashfull.mack.net:0         Route   1      0          0         
  -> sneezy.mack.net:0           Route   1      0          0         
TCP  lvs2.mack.net:telnet rr
  -> sneezy.mack.net:telnet      Route   1      0          0

the client will telnet to both real-servers in turn as would be expected for an LVS serving only telnet, but all other services (ie !telnet) go to the same first real-server. All services but telnet are persistent.

The director will make persistent all ports except those that are explicitely set as non-persistent. These two sets of ipvsadm commands do not overwrite each other. Persistent and non-persistent connections can be made at the same time.

(from Julian) This is part of the LVS design. The templates used for persistence are not inspected when scheduling packets for non-persistent connections.

Examples of persistence

ftp (VS-NAT): connections to both ftp ports are handled by the module ip_masq_ftp. You don't need to add persistence for ftp with VS-NAT.
ftp (VS-DR or VS-Tun): you need persistence on the realservers. Run the first set of commands above.
ftp and http (VS-NAT): persistence not needed (ip_masq_ftp handled the two ftp ports).
ftp and http (VS-DR or VS-Tun): persistence needed to handle the two port protocol ftp. You can set this up by running the first set of commands above. A secondary consequence of this arrangement is that a client connecting to the http service of the LVS will always get the same real-server (this may not be a great problem). If you want to make the http service non-persistent but leaving all other services persistent, then run the second set of commands (above) as well.
http and https (all forwarding methods): Normally an https connection is made after the client has made selections on an http connection when data is stored on the real-server for the client. In this case the real-server should be made persistent (first set of commands above). Do not run ipvsadm commands for http (like the second set of commands above) as this will make the http connections non-persistent.

Note: making real-server connections persistent allows _all_ ports to be forwarded by the LVS to the real-servers. Non-persistent LVS connections are only for the nominated service. An open, persistently connected real-server is a security hazard. You should run ipchains commands on the director to block all services on the VIP except those you want forwarded to the real-servers.

Multiple proxies

From: Jeremy Johnson jjohnson@real.com

> > how does LVS handles a single client that uses
> > multiple proxies... for instance aol, when an aol user attempts to connect
> > to a website, each request can come from a different proxy so, how/if does
> > LVS know that the request is from the same client and bind them to the same
> > server?

(Joe)

> if this is what aol does then each request will be independant and will
> not neccessarily go to the same real-server. Previous discussions about aol
> have assumed that everyone from aol was coming out of the same IP.
> Currently this is handled by making the connection persistant and all
> connections from aol will go to one real-server.

(Michael Sparks zathras@epsilon3.mcc.ac.uk) If ISP user (eg AOL) has a proxy array/farm then the requests are _likely_ to come from two possibilities:

* A single subnet (if using an L4/L7 switch that rewrites ether frames, or using several NAT based L4/L7 switches) or

* A single IP (If using the common form of L4/L7 switch)

The former can be handled using a subnet mask in the persistance settings, the latter is handled by normal persistance.

*However* In the case of our proxy farm neither of these would work since we have 2 subnet ranges for our systems - 194.83.240/24 & 194.82.103/24, and an end user request may come out of each subnet totally defeating the persistance idea... (in fact dependent on our clients configuration of their caches, the request could appear to come from the above two subnets or the above 2 subnets and about 1000 other ones as well)

Unfortunately this problem is more common that might be obvious, due to the NLANR hierarchy, so whilst persistance on IP/subnet solves a large number of problems, it can't solve all of them.

7.6 Changing weights with ipvsadm

When setting up a service, you set the weight with a command like (default for -w is 1).

ipvsadm -a -t $VIP:$SERVICE -r $REALSERVER_NAME:$SERVICE $FORWARDING -w 1

If you set the weight for the service to "0", then no new connections will be made to that service (see also man ipvsadm, about the -w option).

Lars Marowsky-Bree lmb@suse.de 11 May 2001

Setting weight = 0 means that no further connections will be assigned to the machine, but current ones remain established. This allows to smoothly take a real server out of service, ie for maintenance.
Removing the server hard cuts all active connections. This is the correct response to a monitoring failure, so that clients receive immediate notice that the server they are connected to died so they can reconnect.

Laurent Lefoll Laurent.Lefoll@mobileway.com 11 May 2001

Is there a way to clear some entries in the ipvs tables ? If a server reboots or crashes, the connection entries remains in the ipvsadm table. Is there a way to remove manually some entries? I have tried to remove the real server from the service (with ipvsadm -d .... ), but the entries are still there.
Joe
After a service (or real-server) failure, some agent external to LVS will run ipvsadm to delete the entry for the service. Once this is done no new connections can be made to that service, but the entries are kept in the table till they timeout. (If the service is still up, you can delete the entries and then re-add the service and the client will not have been disconnected). You can't "remove" those entries, you can only change the timeout values.
Any clients connected through those entries to the failed service(s) will find their connection hung or deranged in some way. We can't do anything about that. The client will have to disconnect and make a new connection. For http where the client makes a new connection almost every page fetch, this is not a problem. Someone connected to a database may find their screen has frozen.

If you are going to set the weight of a connection, you need to first know the state of the LVS. If the service is not already in the ipvsadm table, you add (-a) it. If the service is already in the ipvsadm table, you edit (-a) it. There is no command to just set the weight no matter what the state. A patch exists to do this (from Horms) but Wensong doesn't want to include it. Scripts which dynamically add, delete or change weights on services will have to know the state of the LVS before making any changes, or else trap errors from running the wrong command.

7.7 experimental scheduling code

This section is a bit out of date now. See the ipvsadm new schedulers by Thomas Prouell for web caches and by Henrik Norstrom for firewalls. Ratz ratz@tac.ch has produced a scheduler which will keep activity on a particular real-server below a fixed level.

For this next code write to Ty or grab the code off the list server

Date: Wed, 23 Feb 2000 From: Ty Beede tybeede@metrolist.net

This is a hack to the ip_vs_wlc.c schedualing algorithm. It is curently implemnted in a quick, ad hoc fashion. It's purpose is to support limiting the total number of connections to a real server. Currently it is implmented using the weigh value as the upper limit on the number of activeconns(connections in an established TCP state). This is a very simple implementation and only took a few minutes after reading through the source. I would like, however, to develop it further.

Due to it's simple nature it will not function in several types of enviroments, those based on connectionless protocals (UDP, this uses the inactconns variable to keep track of things, simply change the activeconns varible-in the weigh check- to inactconns for UDP) and it may impose complecations when persistance is implemented. The current algorimthm simply checks that weight > activeconns before including a server in the standard wlc scheduling. This works for my enviroment, but could be changed to perhaps (weight * 50) > (activeconns * 50) + inactconns to include the inactconns but make the activeconns more important in the decison.

Currently the greatest weight value a user may specify is approimalty 65000, independant of this modification. As long as the user keeps most importanly the weight values correct for the total number of connections and in porportion to one another the things should function as expected.

In the event that the cluster is full, all real severs have maxed out, then it might be neccessary for overflow control, or the client's end will hang. I haven't tested this idea but it could simply be implemented by specifing the over flow server last, after the real severs using the ipvsadm tool. This will work because as each real server is added using ipvsadm it is put on a list, with the last one added being last on the list. The scheduling algorithm traverses this list linearly from start to finish and if it finds that all severs are maxed out, then the last one will be the overflow and that will be the only one to send traffic to.

Anyway this is just a little hack, read the code and it should make sense. It has been included as an attachment. If you would like to test this simply replace the old ip_vs_wlc.c scheduling file in /usr/src/linux/net/ipv4 with this one. Compile it in and set the weight on the real severs to the max number of connections in an established TCP state or modifiy the source to your liking.

Date: Mon, 28 Feb 2000 From: Ty Beede tybeede@metrolist.net

I wrote a little patch and posted it a few days ago... I indicated that overflow might be accomplished by adding the overflow server to the lvs last. This statement is completely off the wall wrong. I'm not really sure why I thought that would work but it won't, first of all the linked list adds each new instance of a real sever to the start of the real servers list, not the end like I though. Also it would be impossible do distingish the overflow server from the real servers in the case that not all the real-servers were busy. I don't know where I got that idea from but I'm going to blame it on my "bushy eyed youth". In responce to needing overflow support I'm thinking about implementing "prority groups" into the lvs code. This would logically group the real severs into different groups, though with a higher priority group would fillup before those with a lower grouping. If anybody could comment on this it would be nice to hear what the rest of you think about overflow code.

Theoretical issues in developing better scheduling algorithms

 (Julian)
 > > It seems to me it would be useful in some cases to use the total number
 > > of connections to a real server in the load balancing calculation, in
 > > the case where the real server participates in servicing a number of
 > > different VIPs.
 > >
 (Wensong)
 > Yeah, it is true. Sometimes, we need tradeoff between
 > simplicity/performance and functionality. Let me think more about
 > this, and probably maximum connection scheduling together together
 > too. For a rather big server cluster, there may be a dedicated load
 > balancer for web traffic and another load balancer for mail traffic,
 > then the two load balancers may need exchange status periodically, it
 > is rather complicated.
 
 Yes, if a real server is used from two or more directors
 the "lc" method is useless.
 
 > Actually, I just thought that dynamic weight adaption according to
 > periodical load feedback of each server might solve all the above
 > problems.

(Joe - this is part of a greater problem with LVS, we don't have good monitoring tools and we don't have a lot of information on the varying loads that real-servers have, in order to develope strategies for informed load regulation)

(Julian) From my experience with real servers for web, the only useful parameters for the real server load are:

 
         - cpu idle time
 
                 If you use real servers with equal CPUs (MHz)
                 the cpu idle time in percents can be used.
                 In other cases the MHz must be included in
                 a expression for the weight.
 
         - free ram
 
                 According to the web load the right expression
                 must be used including the cpu idle time
                 and the free ram.
 
         - free swap
 
                 Very bad if the web is swapping.

The easiest parameter to get, the Load Average is always < 5. So, it can't be used for weights in this case. May be for SMTP ? The sendmail guys use only the load average in sendmail when evaluating the load :)

So, the monitoring software must send these parameters to all directors. But even now each of the directors use these weights to create connections proportionally. So, it is useful these parameters for the load to be updated in short intervals and they must be averaged for this period. It is very bad to use current value for a parameter to evaluate the weight in the director. For example, it is very useful to use something like "Average value for the cpu idle time for the last 10 seconds" and to broadcast this value to the director on each 10 seconds. If the cpu idle time is 0, the free ram must be used. It depends on which resource zeroed first: the cpu idle time or the free ram. The weight must be changed slightly :)

The "*lc" algorithms help for simple setups, eg. with one director and for some of the services, eg http, https. It is difficult even for ftp and smtp to use these schedulers. When the requests are very different, the only valid information is the load in the real server.

Other useful parameter is the network traffic (ftp). But again, all these parameters must be used from the director to build the weight using a complex expression.

I think the complex weight for the real server based on connection number (lc) is not useful due to the different load from each of the services. May be for the "wlc" scheduling method ? I know that the users want LVS to do everything but the load balancing is very complex job. If you handle web traffic you can be happy with any of the current scheduling methods. I didn't tried to balance ftp traffic but I don't expect much help from *lc methods. The real server can be loaded, for example, if you build new Linux kernel while the server is in the cluster :) Very easy way to switch to swap mode if your load is near 100%.

Next Previous Contents