LVS-HOWTO: Fwmarks

The original method for settting up an LVS was to use the VIP as the target for ipvsadm commands. A more flexible method, of using firewall marks (fwmarks) was introduced by Horms in Apr 2000. Ted Pavlic then showed how to group arbitary services using fwmarks, which had previously been possible to a limited extent with LVS persistence. The fwmarks method is more flexible and simpler to administer.

fwmark is used

to group services together within a single LVS e.g.
- group 1 - port 80,443 for an e-commerce site
- group 2 - port 20,21 for an ftp server
group large numbers of IP together

Setting up an LVS on fwmarks rather than the VIP is now the method of choice for anything but a collection of simple one port non-persistent services. Fwmark should be used instead of the VIP when persistence is required or multiport services are involved.

Some history (Horms)

The impeteus origionally came out of a VA Linux Systems Professional Services customer who I was called onsite to help sway towards using LVS. My original proposal and implementation was to allow virtual services based on netmasks. Wensong rejected this because of some potential performance issues.

I distinctly remember working on the original implementation on a train trip from the Blue Mountains to Sydney's Central Station. By the time I had to change trains go to Wynyard the code was working :)
A few days latter Julian came up with the idea of using a fwmark, a feature of the ip_masq code that had been around for a while but wasn't heavily used. Wensong passed this on to me. I wrote the kernel, ipvsadm and ldirectord changes and largely have maintained them ever since. I believe they were included with ipvs-0.9.9.
It is of note that as a part of the work that came out of this customer the -R and -S options to ipvsadm were suggested and implemented by myself. These were released just before the inclusion of the fwmark code.
This customer was also the impetus for putting together what is now known as Ultra Monkey. All in all quite an interesting outcome for a couple of days on site. Pleasingly I believe that the customer in question is using Ultra Monkey with the fwmark support in LVS.

Sample configurations/topologies for fwmarks are at Ultramonkey.

8.2 single port service: telnet with fwmarks

Assuming you already have setup the networks and default gw for the machines in your LVS, here's how you'd setup telnet without fwmarks (ie the "normal" method, using the VIP as the target for ipvsadm commands) on a two real-server LVS.

#make a table for connections to VIP:telnet, with round robin scheduling
#schedule real-server bashfull for connections to VIP:telnet, weight=1, forwarding method=DR
#schedule real-server sneezy for connections to VIP:telnet, weight=1, forwarding method=DR
director:# ipvsadm -A -t VIP:telnet -s rr
director:# ipvsadm -a -t VIP:telnet -r bashfull:telnet -g -w 1
director:# ipvsadm -a -t VIP:telnet -r sneezy:telnet -g -w 1

Here's how to do the same thing with fwmarks. You first mark the packets with ipchains or iptables.

ipchains for 2.2.x director

#flush ipchains tables
#mark with value=1, tcp packets from anywhere, arriving on eth1 (holds the VIP on my setup), 
#with dst_addr=192.168.2.110 (the VIP) for port telnet
#show ipchains tables
director:# ipchains -F
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport telnet -m 1
director:# ipchains -L input         
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
-          tcp  ------  anywhere             lvs2.mack.net         any ->   telnet

iptables for 2.4.x director

The iptables parameters are taken from an example by Paul Schulz, which I found through google.

First put a mark of value=1 on tcp packets which arrive from anywhere with dst_addr=VIP:telnet (the VIP is on eth1 in my setup).

#flush the mangle table
#in the skb, put mark=1 on all tcp packets arriving on eth1 from anywhere, with dest=VIP:telnet
#output the mangle table, just for a look
director:# iptables -F -t mangle
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport telnet -j MARK --set-mark 1
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net       tcp dpt:telnet MARK set 0x1 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

The fwmark is only associated with the packet while it is in the director skb (socket buffer). The packet which emerges from the director and is forwarded to the real-server is a normal (unmarked) packet. (You can't use the director's fwmark information on the real-server to decide on how to handle the packet.)

install the LVS with ipvsadm

#setup an ipvsadm table for packets with mark=1, schedule them with round robin.
#schedule real-server sneezy for connections with mark=1, forwarding method=DR, weight=1
#schedule real-server bashfull for connections with mark=1, forwarding method=DR, weight=1
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r sneezy.mack.net:telnet -g -w 1
director:# ipvsadm -a -f 1 -r bashfull.mack.net:telnet -g -w 1

Here's the output of ipvsadm

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.7 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> bashfull.mack.net:23             Route   1      0          0         
  -> sneezy.mack.net:23               Route   1      0          0

You can now telnet to the VIP. You'll get the expected round robin scheduling of your connections to bashfull and sneezy.

8.3 Grouping services: single group, active ftp(20,21)

The telnet example above could equally well be done using the VIP or a fwmark as the target for ipvsadm commands. The same is true for any one port service, where connections to services are made independantly of each other. Sometimes we need to group services together, e.g. port 20,21 for an ftp server or port 80, 443 for an e-commerce site. The current method for handling this, persistent connections, links all ports on the VIP, and the director will forward connections to all ports, not just the two we are interested in. For security purposes, if persistence is used to group services, then connection requests to the other ports will have to be blocked. Although workable, it's an ugly solution.

For background on how the specifications for fwmarks were set to allow services to be grouped, see Appendix 1 for the initial discussion between Ted and the LVS developers (Horms and Julian), Appendix 2 where Ted let me know that he'd had it working, and Appendix 3 for Ted's announcement to the mailing list.

grouping using VIP and persistence

Here's an example grouping ports 20,21 for ftp. This uses the VIP as the target for ipvsadm commands with persistent connection (the original, VIP way).

#make a table for connections to all ports on VIP with round robin scheduling, persistence timeout=360secs
#schedule real-server bashfull for connections to all ports on VIP, weight=1, forwarding method=DR
#schedule real-server sneezy for connections to all ports on VIP, weight=1, forwarding method=DR
director:# ipvsadm -A -t VIP:0 -s rr -p 360
director:# ipvsadm -a -t VIP:0 -r sneezy.mack.net:0 -g -w 1
director:# ipvsadm -a -t VIP:0 -r bashfull.mack.net:0 -g -w 1

Here's the output of ipvsadm

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.7 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> sneezy.mack.net:0                Route   1      0          0         
  -> bashfull.mack.net:0              Route   1      0          0

After the client has made the initial connection on port 21, then any subsequent connection on port 20 (within the 360sec timeout period) will go to the same real-server.

The problem is that the director will forward to the same real-server, connection requests made to any port by the client. If we have port 80 and 443 on the real-server, then these services will be linked to each other (which we may want), and they will also be linked to the ftp service (which we may not want). If you telnet to the VIP, this request will be forwarded to the real-servers too (in production you'll have to block this).

grouping with fwmarks

Here's how to setup an ftp server with fwmarks. First mark the packets of interest with ipchains or iptables (i.emark all tcp packets to VIP:ftp and VIP:ftp-data arriving on eth1).

ipchains for 2.2 director

#flush ipchains tables
#mark ftp packets
#put the same mark on ftp-data packets
#show ipchains tables
director:# ipchains -F
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -m 1
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp-data -m 1 
director:# ipchains -L input
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp-data

iptables for 2.4 director

#clear mangle table
#mark ftp packets
#put the same mark on ftp-data packets
#show mangle table
director:# iptables -F -t mangle
director:/etc/lvs# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -j MARK --set-mark 1
director:/etc/lvs# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp-data -j MARK --set-mark 1
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

install LVS with ipvsadm

Next setup ipvsadm to schedule packets marked with fwmark=1 to your real-servers. You need persistence (here timeout set to 600secs).

director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r sneezy.mack.net:0 -g -w 1
director:# ipvsadm -a -f 1 -r bashfull.mack.net:0 -g -w 1

Here's the output of ipvsadm with two current connections to the LVS and 3 expiring ones. Note they are all to the same real-server, as expected for a persistent connection. Since forwarding is by VS-NAT, the ip_vs_ftp module automatically loads.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.7 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      2          3         
  -> sneezy.mack.net:0                Route   1      0          0

A netpipe test showed the same latency and throughput for a connection based on fwmark or based on VIP.

What happens now when you telnet from the client to the VIP? (pause to let you think.) The director is only forwarding packets with fwmark=1 to the LVS, so a telnet request to the VIP is accepted by the director and not forwarded to the real-servers. If telnetd is running on the director, you'll get a login prompt from the director. In production you'll have to block this too (just like you had to when setting up on a VIP).

So what's the difference, you ask, between setting up an ftp server with persistence on the VIP on one hand (which requires you to block all other packets with iptables rules), and grouping 20,21 with fwmarks on the other (which requires exactly the same blocking of unwanted packets)? Not a lot. At the moment you're at least even

Lars Marowsky-Brée lmb@suse.de 2000-05-11

When using the LVS box as a firewall/router, the fwmark technique is a perfectly adequate solution, which doesn't cost anything.

But look at the next example.

8.4 Grouping services: two groups, active ftp(20,21) and e-commerce(80,443)

Setup 2 groups of services, group 1 - ftp(20,21), group 2 - ecommerce(80,443).

First mark packets in 2 groups.

ipchains for 2.2 director

director:# ipchains -F
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -m 1
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp-data -m 1
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport http -m 2
director:# ipchains -A input -p tcp -i eth1 -s 0.0.0.0/0 -d 192.168.2.110/32 --dport https -m 2
director:# ipchains -L input
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp
-          tcp  ------  anywhere             lvs2.mack.net         any ->   ftp-data
-          tcp  ------  anywhere             lvs2.mack.net         any ->   www
-          tcp  ------  anywhere             lvs2.mack.net         any ->   https

iptables for 2.4 director

director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp-data -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport http -j MARK --set-mark 2
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport https -j MARK --set-mark 2
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:www MARK set 0x2 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:https MARK set 0x2 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

setup LVS to schedule (with persistence) 2 groups of packets

Note: The ipvs code in Apr 2001 needed a patch to get the expected behaviour. This section describes the function of LVS before and after patching. As a result of these tests, the patch will be applied to future releases. ipvs-1.0.7-2.2.19 is already patched (Apr 2001). The 2.4.3 series are not patched yet. To see if the code has been patched look in ipvs/Changelog for something like this

Julian changed persistent connection template for fwmark-based service from <CIP,VIP,RIP> to <CIP,FWMARK,RIP>, so that different fwmark-based services that share the same VIP can work correctly.

If your ipvs code is pre-patched, then you can skip down to the part where the behaviour after applying the patch is described. If your code isn't patched, you should just go get the patch and skip to the part where the expected behaviour is described.

Otherwise here's what happened with the original code.

director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r sneezy.mack.net:0 -g -w 1
director:# ipvsadm -a -f 1 -r bashfull.mack.net:0 -g -w 1
director:# ipvsadm -A -f 2 -s rr -p 600
director:# ipvsadm -a -f 2 -r sneezy.mack.net:0 -g -w 1
director:# ipvsadm -a -f 2 -r bashfull.mack.net:0 -g -w 1

IP Virtual Server version 0.2.7 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      0          0         
  -> sneezy.mack.net:0                Route   1      0          0         
FWM  2 rr persistent 600
  -> bashfull.mack.net:0              Route   1      0          0         
  -> sneezy.mack.net:0                Route   1      0          0

If you ftp and http to the VIP, you'd expect the ftp connections to go to fwmark 1 (presumably to the first real-server bashfull) and the http connections to go to fwmark 2 (again presumably to bashfull).

With the director running 1.0.6-2.2.19 (ipvs/kernel version), all connections (ftp, http) go to group 1. With the director 0.2.7-2.4.2, all connections go to group 2. Here's the output from ipvsadm for the 2.2.19 example immediately after downloading a webpage. You would expect the http InActConn to be associated with FWM2.

IP Virtual Server version 1.0.6 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
FWM  1 rr persistent 30
  -> bashfull.mack.net:0            Route   1      0          2         
  -> sneezy.mack.net:0              Route   1      0          0         
FWM  2 rr persistent 30
  -> bashfull.mack.net:0            Route   1      0          0         
  -> sneezy.mack.net:0              Route   1      0          0         
director:/etc/lvs#

It appears (Apr 2001) that the ipvs code doesn't really follow the persistent fwmarks spec. When there is a collision between VIP space and fwmark space (eg in these examples, where all packets are going to the same VIP), then the VIP takes precedence and the two fwmark groups are not differentiated. The collision arises because there is only one set of templates for the connection tables.

The code to produce the expected behaviour requires a separate set of templates for fwmarks and VIP. The patch to do this is on Julian's patch page and has names like persistent-fwmark-0.2.8-2.4-1.diff, persistent-fwmark-1.0.5-2.2.18-1.diff. (Note: the 0.2.8 patch had DOS carriage control and wouldn't patch till I removed the ^M characters). (Note: as of ipvs-0.9.0, this patch has been applied to the source tree.)

After patching the ip_vs code to produce the new ip_vs.o module (rmmod the old one first), you get the expected fwmark behaviour. Here's the output of ipvsadm after ftp'ing and http'ing from a client. Note that the ftp connection is to fwmark=1. The InActConn is the expiring connection from the http client to fwmark=2.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 30
  -> bashfull.mack.net:0              Route   1      1          0         
  -> sneezy.mack.net:0                Route   1      0          0         
FWM  2 rr persistent 30
  -> bashfull.mack.net:0              Route   1      0          1         
  -> sneezy.mack.net:0                Route   1      0          0

The original idea from Ted Pavlic

Ted Pavlic tpavlic@netwalk.com 2000-10-08

Just another persistence option that you may or may not have thought of... LVS does support port-group sticky persistance. Before FWMARK support was added to LVS, the only types of persistance one could do were:

One port persistence (all queries to 80 return to the same real server per CIP)
ALL port persistence (all queries to all ports return to the same RIP per CIP)

But now that FWMARK support exists in LVS, it is easy to create group-based sticky persistence. That is... It adds the option where:

Only these two ports (443 and 80) return to the same RIP per CIP
Meanwhile, another persistence table keeps track of 20, 21, and 1024:65535
Any other port is not persistent

Just have ipchains keep track of flagging the incoming packets with the correct port group identifier:

ipchains -A input -D VIPNET/VIPMASK PORT -p PROTOCOL -m FWMARK

And have IPVS stop looking at IPs and start look at FWMARKs:

ipvsadm -A -f FWMARK
ipvsadm -a -f FWMARK -r RIP:0

ssl and cookies

Ted Pavlic tpavlic@netwalk.com 2000-10-13

LVS DIRECTLY supports two types of persistence and INDIRECTLY supports another. If you are just asking how to make port 443 persistent so that those who receive a cookie on 443 will come back to the same real server on 443, simply:

/sbin/ipvsadm -A -t 192.168.1.110:443 -p
/sbin/ipvsadm -a -t 192.168.1.110:443 -R 192.168.2.1
/sbin/ipvsadm -a -t 192.168.1.110:443 -R 192.168.2.2
/sbin/ipvsadm -a -t 192.168.1.110:443 -R 192.168.2.3
...

Will setup persistence just for port 443.

However, say someone gets a cookie on port 80 and gives it back on port 443 -- in that case you want to have persistence between multiple ports. Using port 0 accomplishes this:

/sbin/ipvsadm -A -t 192.168.1.110:0 -p
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.2.1
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.2.2
/sbin/ipvsadm -a -t 192.168.1.110:0 -R 192.168.2.3
...

In this setup, anyone who visits ANY service will continue to go back to the same real server. So requests which come in on 80 or 443 will continue to come in to the same real server regardless of port.

This is an OK solution, but it basically makes all services persistent which might mess up scheduling. That is, this is a decent solution but sometimes not extremely desirable.

If you want to simply group ports 80 and 443 together, you need to do something more intuitive. Use FWMARK...

ipchains -A input -d 192.168.1.110/32 80 -p tcp -m 1
ipchains -A input -d 192.168.1.110/32 443 -p tcp -m 1
/sbin/ipvsadm -A -f 1 -p
/sbin/ipvsadm -a -f 1 -R 192.168.2.1
/sbin/ipvsadm -a -f 1 -R 192.168.2.2
/sbin/ipvsadm -a -f 1 -R 192.168.2.3
...

Now only port 80 and 443 will be grouped together via persistence. Any other ipvsadm rules will be completely separate. This means that you can make 80 and 443 persistence by their own little "port group" and leave ports 25 and 110 (for example) not persistent. OR... You could group all the FTP ports together as well on a completely different persistence group... i.e.

ipchains -A input -d 192.168.1.110/32 80 -p tcp -m 1
ipchains -A input -d 192.168.1.110/32 443 -p tcp -m 1
/sbin/ipvsadm -A -f 1 -p
/sbin/ipvsadm -a -f 1 -R 192.168.2.1
/sbin/ipvsadm -a -f 1 -R 192.168.2.2
/sbin/ipvsadm -a -f 1 -R 192.168.2.3
# Really adding port 20 isn't needed
ipchains -A input -d 192.168.1.110/32 20 -p tcp -m 2
ipchains -A input -d 192.168.1.110/32 21 -p tcp -m 2
ipchains -A input -d 192.168.1.110/32 1024:65535 -p tcp -m 2
/sbin/ipvsadm -A -f 2 -p
/sbin/ipvsadm -a -f 2 -R 192.168.2.1
/sbin/ipvsadm -a -f 2 -R 192.168.2.2
/sbin/ipvsadm -a -f 2 -R 192.168.2.3
...

and again

Wayne wrote

Is there a easy way to relating server in both port 80 and port 443 (with VS-NAT)?

Say I have two farms, each with same three servers. One farm load balancing HTTP requests and another farm load balancing HTTPS farms. To make sure the user in the persistent mode connected to the HTTP server always go to the same server for HTTPS service, we would like to have some way to relate the services between the two farms, is there a easy way to do it?

ratz ratz@tac.ch 2001-01-03

Two possibilities to solve this with LVS

Use port 0 in your setup. (adv.: easy to set up and easy understand)
Use fwmark and group them together. (adv.: finer port granularity possible)

Example (1):

ipvsadm -A -t 192.168.1.100:0 -s wlc -p 333 -M 255.255.255.255
ipvsadm -a -t 192.168.1.100:0 -r 192.168.1.1 -g -w 1
ipvsadm -a -t 192.168.1.100:0 -r 192.168.1.2 -g -w 1

Example (2):

ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 80 -m 1 -l
ipchains -A input -j ACCEPT -p tcp -d 192.168.1.100/32 443 -m 1 -l
ipvsadm -A -f 1 -s wlc -p 333 -M 255.255.255.255
ipvsadm -a -f 1 -r 192.168.1.1 -g -w 1
ipvsadm -a -f 1 -r 192.168.1.2 -g -w 1

8.5 passive ftp

You can setup passive ftp with the VIP as the target using persistence. This is not a particular satisfactory solution, as connect requests to all ports will be forwarded. As well, if another service on the real-server fails (eg http), then all services have to be failed out together.

Here's a solution to passive ftp from Ted Pavlic using fwmark. This allows setting up passive ftp independantly of other services. Passive ftp listens on an unknown and unpredictable high port on real-server. This is handled by forwarding requests to all high ports (it's still ugly, but at least this way, we can fail out ftp independently of other services).

test session with active ftp

Here's ftp setup in active mode, as a control.

director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1 
#
#setup ipvsadm, making all packets with mark=1 persistent
director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r sneezy:0 -g -w 1
director:# ipvsadm -a -f 1 -r bashfull:0 -g -w 1
director:# ipvsadm 
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      0          0         
  -> sneezy.mack.net:0                Route   1      0          0

Here's netstat -an on the client and the real-server (bashfull) immediately after an ftp file transfer (with the client still connected).

#client:
client:~# netstat -an | grep 110 #110 is part of the VIP
tcp        0      0 client:1176   VIP:21           ESTABLISHED 
#real-server
bashfull:/home/ftp/pub# netstat -an | grep 254 #254 is part of the client IP
tcp        0      0 VIP:20        client:1180      TIME_WAIT   
tcp        0      0 VIP:20        client:1178      TIME_WAIT   
tcp        0      0 VIP:20        client:1177      TIME_WAIT   
tcp        0      0 VIP:21        client:1176      ESTABLISHED

Only port 20,21 are involved here.

Here's the command line at the client during the active ftp transfer (all expected output).

ftp> get tulip.c
local: tulip.c remote: tulip.c
200 PORT command successful.
150 Opening BINARY mode data connection for tulip.c (104241 bytes).
226 Transfer complete.
104241 bytes received in 0.0232 secs (4.4e+03 Kbytes/sec)

The iptables rules on the director do not allow passive ftp connection. To test this put the ftp client into passive mode.

ftp> pass
Passive mode on.
ftp> dir
227 Entering Passive Mode (192,168,2,110,4,72)
ftp: connect: Connection refused
ftp>

connection is not allowed. To check that the system is still functioning, put the client back into active mode.

ftp> pass
Passive mode off.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for /bin/ls.
total 155178
.
.
-rw-r--r--   1 root     root       104241 Nov 10  1999 tulip.c
226 Transfer complete.
ftp>

test session with passive ftp

Here's the setup for passive ftp (2.4.x director) (you can leave ipvsadm untouched).

director:# iptables -F -t mangle
#mark ftp packets
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport 1024: -j MARK --set-mark 1
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpts:1024:65535 MARK set 0x1

Here's the command line from the ftp client still in active mode

ftp>  dir
200 PORT command successful.

The session is hung, the server shows an established connection to port 21 and the client session has to be killed.

Here's the passive session.

client:~# ftp VIP
Connected to VIP.
220 bashfull.mack.net FTP server (Version wu-2.4.2-academ[BETA-15](1) Wed May 20
 13:45:04 CDT 1998) ready.
Name (VIP:root): ftp
331 Guest login ok, send your complete e-mail address as password.
Password:
230 Guest login ok, access restrictions apply.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> pass
Passive mode on.
ftp> cd pub
250 CWD command successful.
ftp> dir *.c
227 Entering Passive Mode (192,168,2,110,4,75)
150 Opening ASCII mode data connection for /bin/ls.
-rw-r--r--   1 root     root       104241 Nov 10  1999 tulip.c
226 Transfer complete.
ftp> mget *.c
mget tulip.c? y
227 Entering Passive Mode (192,168,2,110,4,78)
150 Opening BINARY mode data connection for tulip.c (104241 bytes).
226 Transfer complete.
104241 bytes received in 0.0233 secs (4.4e+03 Kbytes/sec)
ftp>

Here's the connections at the real-server immediately after the file transfer. There is the regular connection at the ftp port (21) and a connection timing out to a high port on the real-server.

bashfull:/home/ftp/pub# netstat -an | grep 254 #254 is part of the client IP
tcp        0      0 VIP:1104      client:1191      TIME_WAIT   
tcp        0      0 VIP:21        client:1184      ESTABLISHED

Here's the output from ipvsadm after connecting to the URL ftp://vip/ using a web-browser

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      1          5         
  -> sneezy.mack.net:0                Route   1      0          0

LVS with 2 groups: group 1 = ftp(active and passive), group 2 = http

#fwmark rules
director:# iptables -F -t mangle
#active and passive ftp in group 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp-data -j MARK --set-mark 1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport 1024: -j MARK --set-mark 1
#http as group 2
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport http -j MARK --set-mark 2
director:/etc/lvs# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp-data MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpts:1024:65535 MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:www MARK set 0x2 
#
#setup LVS for 2 groups
director:# ipvsadm -C
#ftp (active and passive) are persistent as group 1 
director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r sneezy:0 -g -w 1
director:# ipvsadm -a -f 1 -r bashfull:0 -g -w 1
#http as group 2 (not persistent)
director:# ipvsadm -A -f 2 -s rr       
director:# ipvsadm -a -f 2 -r sneezy:http -g -w 1
director:# ipvsadm -a -f 2 -r bashfull:http -g -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      0          0         
  -> sneezy.mack.net:0                Route   1      0          0         
FWM  2 rr
  -> sneezy.mack.net:80               Route   1      0          0         
  -> bashfull.mack.net:80             Route   1      0          0

The client connected (in order) ftp://VIP/, http://VIP/ (passive ftp) and then by active (command line) ftp to VIP. Here's the ipvsadm output.

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      2          3         
  -> sneezy.mack.net:0                Route   1      0          0         
FWM  2 rr
  -> sneezy.mack.net:80               Route   1      2          2         
  -> bashfull.mack.net:80             Route   1      4          0

Here's the connections showing on the real-server. The most recent ones are at the top of the list. The connection list shows (from the bottom, i.e. in the order of connection), passive ftp, http, and active ftp.

bashfull:/home/ftp/pub# netstat -an | grep 254 #254 is part of the CIP
tcp        0      0 VIP:21        client:1207      ESTABLISHED 
tcp        0      0 VIP:80        client:1206      FIN_WAIT2   
tcp        0      0 VIP:80        client:1204      FIN_WAIT2   
tcp        0      0 VIP:1108      client:1202      TIME_WAIT   
tcp        0      0 VIP:21        client:1201      ESTABLISHED

The whole point of this setup is to make ftp and http, which belonged to one persistence group when setup on a VIP, into two groups. Now you can bring the httpd and the ftpd up and down independantly (if you want to fail them out, to change the configuration or software).

8.6 fwmark with VS-NAT

(based on a posting by Horms on 14 Jul 2000)

Here we setup a VS-NAT LVS on a 2.4.x director. (Note: With 2.4 LVS, the masquerading is setup by the ipvs code, i.e. you don't have to masquerade the packets back from the real-servers). These examples assume that the VIP is on eth1 and your network is already setup (ie the real-servers are using the director as the default gw etc).

Mark packets for the VIP and setup the LVS for telnet. (Warning: this first example is not going to get you anything you want.)

 
#
#mark packets
director:# iptables -F -t mangle
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 -j MARK --set-mark 1
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      MARK set 0x1 
#
#Setup ipvsadm
director:# ipvsadm -C
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r sneezy.mack.net:telnet -m -w 1
director:# ipvsadm -a -f 1 -r bashfull.mack.net:telnet -m -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> bashfull.mack.net:23             Masq    1      0          0         
  -> sneezy.mack.net:23               Masq    1      0          0

You can connect with telnet to the VIP and you'll be forwarded to both real-servers in the expected way.

All packets from the client will be marked and processed by the ipvsadm rules. What happens if you attempt to connect to VIP:80 (pause to think)?

Here's the answer.

client:~# telnet VIP 80
Trying 192.168.2.110...
Connected to lvs2.mack.net.
Escape character is '^]'.

Welcome to Linux 2.2.19.


bashfull login: root
Linux 2.2.19.
Last login: Fri Apr 13 11:43:52 on ttyp1 from client2.mack.net.
No mail.

If you connect to VIP:80 with a browser, it sits there showing the watch symbol for quite a while.

What happened? The explanation is that you told the director to mark all packets (i.e. from any port) from the client, rewrite them to have dest_addr=RIP:telnet and forward the rewritten packets to the real-server. So when you telnet'ed to VIP:80, the packets were forwarded to RIP:23.

Just to make sure that I'd interpretted this correctly, here's the first packets seen by tcpdump running on the client and the real-server during the connect attempts. (These are from different sessions, so the ports on the client are different each time.)

client: here the client is connecting to VIP:80 (lvs2.www)

12:09:44.449566 client2.1118 > lvs2.www: S 2887976275:2887976275(0) win 5840 <mss 1460,sackOK,timestamp 118456418[|tcp]> (DF) [tos 0x10]
12:09:44.450453 lvs2.www > client2.1118: S 1441372470:1441372470(0) ack 2887976276 win 32120 <mss 1460,sackOK,timestamp 117741798[|tcp]> (DF)
12:09:44.450579 client2.1118 > lvs2.www: . ack 1 win 5840 <nop,nop,timestamp 118456418 117741798> (DF) [tos 0x10]

real-server (bashfull): here the real-server is receiving packets to the RIP:23 (bashfull.telnet)

11:44:28.319675 client2.1116 > bashfull.telnet: S 2722509719:2722509719(0) win 5840 <mss 1460,sackOK,timestamp 118440378[|tcp]> (DF) [tos 0x10]
11:44:28.319974 bashfull.telnet > client2.1116: S 1283414485:1283414485(0) ack 2722509720 win 32120 <mss 1460,sackOK,timestamp 117725760[|tcp]> (DF)
11:44:28.320681 client2.1116 > bashfull.telnet: . ack 1 win 5840 <nop,nop,timestamp 118440378 117725760> (DF) [tos 0x10]

If you want only telnet requests to be forwarded to the real-servers, you should mark only packets for VIP:telnet. If you want both telnet and http forwarded then you should give them each their own mark. Here's how to setup VS-NAT with fwmark for both telnet and http.

director:# iptables -F -t mangle
#telnet packets to the VIP get fwmark=1
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport telnet -j MARK --set-mark 1
#http packets to the VIP get fwmark=2
director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport http -j MARK --set-mark 2
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:telnet MARK set 0x1 
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:www MARK set 0x2 
#
#setup ipvsadm
director:# ipvsadm -C
#forward packets with mark=1 to the telnet port
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r sneezy.mack.net:telnet -m -w 1
director:# ipvsadm -a -f 1 -r bashfull.mack.net:telnet -m -w 1
#forward packets with mark=2 to the httpd port
director:# ipvsadm -A -f 2 -s rr
director:# ipvsadm -a -f 2 -r sneezy.mack.net:http -m -w 1
director:# ipvsadm -a -f 2 -r bashfull.mack.net:http -m -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> bashfull.mack.net:23             Masq    1      0          0         
  -> sneezy.mack.net:23               Masq    1      0          0         
FWM  2 rr
  -> bashfull.mack.net:80             Masq    1      0          0         
  -> sneezy.mack.net:80               Masq    1      0          0

Here's the (expected) output of ipvsadm showing the client with 2 telnet sessions and having just downloaded a webpage from the LVS.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr
  -> bashfull.mack.net:23             Masq    1      1          0         
  -> sneezy.mack.net:23               Masq    1      1          0         
FWM  2 rr
  -> bashfull.mack.net:80             Masq    1      0          1         
  -> sneezy.mack.net:80               Masq    1      0          0

8.7 Collisions between fwmark and VIP rules

Since it's possible to write iptables rules that include many different types of packets, it's possible to write VIP and fwmark rules that would conflict by accepting the same packet. Here's a setup that would accept telnet by both VIP and fwmarks.

director:# iptables -t mangle -A PREROUTING -i eth1 -p tcp -s 0.0.0.0/0 -d 192.168.2.110/32 --dport ftp -j MARK --set-mark 1
director:# ipvsadm -A -t lvs2.mack.net:telnet -s rr
director:# ipvsadm -a -t lvs2.mack.net:telnet -r sneezy.mack.net:telnet -g -w 1
director:# ipvsadm -a -t lvs2.mack.net:telnet -r bashfull.mack.net:telnet -g -w 1
director:# ipvsadm -A -f 1 -s rr
director:# ipvsadm -a -f 1 -r sneezy.mack.net:telnet -g -w 1
director:# ipvsadm -a -f 1 -r bashfull.mack.net:telnet -g -w 1
#
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs2.mack.net      tcp dpt:ftp MARK set 0x1 
#
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:telnet rr
  -> bashfull.mack.net:telnet         Route   1      0          0         
  -> sneezy.mack.net:telnet           Route   1      0          0         
FWM  1 rr
  -> bashfull.mack.net:telnet         Route   1      0          0         
  -> sneezy.mack.net:telnet           Route   1      0          0

Here's the ipvsadm output after 4 telnet connections from a client

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:telnet rr
  -> bashfull.mack.net:telnet         Route   1      2          0         
  -> sneezy.mack.net:telnet           Route   1      2          0         
FWM  1 rr
  -> bashfull.mack.net:telnet         Route   1      0          0         
  -> sneezy.mack.net:telnet           Route   1      0          0

All connections go to the first (here VIP) entries. The same ipvsadm table and connection pattern results if you feed the VIP and fwmarks rules into ipvsadm in the reverse order. This behaviour is not part of the spec (yet). You might want to check the behaviour, if you are doing this sort of setup.

8.8 persistence granularity with fwmark

introduction

Persistence granularity was added to LVS by Lars lmb@suse.de 1999-10-13

This patch adds netmasks to persistent ports, so you can adjust the granularity of the templates. It should help solve the problems created with non-persistent cache clusters on the client side."

The problem being addressed is that some clients (eg AOL customers) connect to the internet via large proxy farms. The IP they present to the server will not neccessarily be the same for different sessions, even though they remain connected to their proxy machine. Persistence granularity makes all clients from a network equivalent as far as persistence is concerned. Thus a client could appear as CIP=x.x.x.13 for their http connections, but CIP=x.x.x.14 for their https connections. With persistence granularity set to /24, all CIPs from the same class C network will be sent to the same real-server. The default behaviour is for all connections from the same CIP to be sent to the one real-server but other connections from the same network will be scheduled to other real-servers (i.e. the default persistence granularity is /32).

Persistence granularity is applied to the CIP and works the same whether you are using fwmark or the VIP to setup the LVS.

You set the netmask (granularity) for persistence granularity with ipvsadm. If the LVS was setup with the following command, the persistence granularity is 255.255.255.0.

ipvsadm -A -t 192.168.1.100:0 -s wlc -p 333 -M 255.255.255.0

Let's say a client from a class C network (e.g. with IP=100.100.100.2) connects to the LVS. If any other client connects from 100.100.100.0/24 they will also connect to the same real-server as long as the original client's entry in the persistence table has not expired (i.e. the first client is still connected, or disconnected < 333 secs ago).

examples

Here's an example VS-DR LVS set to mark packets for an IP on the outside of the director (this IP serves as the VIP in the usual LVS setup, but there's no such thing as a VIP with fwmarks) with --dport telnet. Persistence granularity is set to the default (-M 255.255.255.255).

director:# ipvsadm -C
director:# ipvsadm -A -f 1 -s rr -p 600
director:# ipvsadm -a -f 1 -r sneezy:0 -g -w 1
director:# ipvsadm -a -f 1 -r bashfull:0 -g -w 1

Two clients (192.168.2.254, 192.168.2.253) connect to the LVS. Each host connects to different real-servers but multiple connects from each client go to the same real-server (i.e. client A always goes to real-server A; client B always goes to real-server B, at least till the persistence timeout clears). Here both clients have connected twice.

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0              Route   1      2          0         
  -> sneezy.mack.net:0                Route   1      2          0

This is the connection pattern expected if the connections were based on the CIP/32 and fwmark (ie all clients are scheduled independently).

Here's the same setup with persistence granularity set to /24.

director:# ipvsadm -C 
director:# ipvsadm -A -f 1 -s rr -p 600 -M 255.255.255.0
director:# ipvsadm -a -f 1 -r sneezy:0 -g -w 1
director:# ipvsadm -a -f 1 -r bashfull:0 -g -w 1
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600 mask 255.255.255.0
  -> bashfull.mack.net:0              Route   1      0          0         
  -> sneezy.mack.net:0                Route   1      0          0

Here's what happens when the 2 clients, both of who belong to the same CIP/24 persistence group, connect twice - all connections go to the same real-server.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600 mask 255.255.255.0
  -> bashfull.mack.net:0              Route   1      4          0         
  -> sneezy.mack.net:0                Route   1      0          0

Discussion with Julian about persistence granularity

Joe

I expect if you were using persistence with fwmark, then any connection requests arriving with the same fwmark will be treated as belonging to that persistence group. Presumably any combination of client IPs and/or networks could have been used to make the rules which marks the packets.

Julian

Yes, it is for the same group but in one fwmark group there are many templates created. These templates are different for the client groups. The template looks like this:

CIPNET:0 -> SERVICE(FWMARK/VIP):0 -> RIP:0

All ports 0 for the fwmark-based services

So, for client 10.1.2.3/24 (24=persistent granularity) the template looks like this:

10.1.2.0:0 -> VIP:0 -> RIP:0

LVS patched with the persistent_fwmark patch:

10.1.2.0:0 -> FWMARK:0 -> RIP:0

So, the templates are created with CIP/GRAN in mind and the lookup uses CIPNET too. We use CIPNET = CIP & CNETMASK before creation and lookup.

so if I did
iptables -s 10.1.2.3 -m 1
ipvsadm -A -f 1 -s rr -p 600 -M 255.255.255.0
only packets from 10.1.2.3 will have a fwmark on them, but the director would forward all packets from 10.1.2.0/24, even those without fwmarks?

The patched LVS will accept only the marked packets for this fwmark service, from the same /24 client subnet. If only one client IP sends packets that are marked then the real service will receive packets only from 10.1.2.3. The current LVS versions don't consider the service and all packets CIPNET -> VIP will be forwarded using the first created template for CIPNET:0->VIP:0, i.e. these packets will randomly hit one of the many services that accept packets for the same VIP (just like in your setup) and then may be a wrong real server.

The current LVS versions don't consider the service and all packets CIPNET -> VIP

but there is no VIP here, I'm using fwmark only. what does the -M 255.255.255.0 do in this case?

The current LVS versions (i.e. without the persistent_fwmark patch) assume the VIP is the iphdr->daddr, i.e. the destination address in the datagram and this addresses is used to lookup/create the template.

how about your persistent-patch, which I've been working with?

The patch ignores this daddr when creating or looking for templates. Instead, the service fwmark values is used when the service is fwmark-based: CIPNET:0 -> FWMARK:0 -> RIP:0

The normal services use daddr as VIP when looking for or creating templates: CIPNET:0 -> daddr:0 -> RIP:0

The persistence is associated with the client address (CIP). The sequence is this:

- packet comes from CIP to VIP1

- fw marking, optional

- lookup for existing connection CIP:CPORT->VIP1:VPORT, if yes => forward, if not found:

- lookup service => fwmark 1, persistent

- try to select real service in context of the virtual service

Apply the persistence granularity to the client address CIPNET = CIP & svc->netmask

Now lookup for template

not patched: check for existing template CIPNET:0, VIP1:0
patched: check for existing template CIPNET:0, 1(fwmark):0

if there is template, bind the new connection to the template's destination

if there is no existing template, get one destination using the scheduler and bind it to the newly created template and the new connection. The created template is

CIPNET:0, VIP1:0, DEST_RIP:0
CIPNET:0, 1(fwmark):0, DEST_RIP:0

- forward the packet

Persistence granularity was designed for people coming in from large proxy servers (eg AOL). With fwmarks, this can be handled by iptables rules.

Yes, the fact that we group the clients using this netmask is not related to the virtual service type: normal or fwmark-based.

Yes, each different IP is treated as different client. When a netmask < 32 is used, the group of addresses is treated as one client when applying the persistence rules. This is not related to the packet marking and virtual service type.

8.9 fwmark allows VS-DR director to be default gw for real-servers

If a VS-DR director is accepting packets by fwmarks, then it does not have a VIP. The director can then be the default gw for the real-servers (see VS-DR director is default gw for real-servers).

8.10 Routing to director and real-servers in an LVS setup with fwmark

If you are using fwmark as the target for the LVS, the destination addresses could be any arbitary grouping of IPs. Since there is no interface on the director with any of these IPs, then some method is needed so that packets destined for the LVS get to the director. This same problem occurs with transparent proxy. The solution is to configure the router to send all those packets to the director.

Once the packets have been forwarded from the director, in the case of VS-DR some method is needed for the real-server to accept the packets. With VS-DR there is no routing problem; the director sends the packets to the interface with the MAC address of the RIP. However the real-server must recognise that the packets are to be processed locally. Several methods are available.

The real-server can accept the packets by transparent proxy
an alias can be set to accept a network of addresses, without assigning an IP to the device Horms horms@vergenet.net 10 Apr 2000
```
ifconfig lo:0 192.168.1.0 netmask 255.255.255.0 
```
You can now ping anything in the 192.168.1.0/24 network from the console. Note: You can't ping any of those IP's from a remote host (after adding a route on the remote host to this network). If you put this network onto an eth0 alias (rather than an lo alias), it doesn't reply to pings from the console - presumably the ping replies in the lo:0 case are coming from 127.0.0.1. For another example of routing to an interface without an IP, see routing to real-servers from director in VS-DR.

8.11 fwmark simplifies configuration for large numbers of addresses

If a fwmark rule accepts packets for a /24 network, then 254 IPs are configured in one instruction. The next sections are examples.

8.12 Example: firewall farm

Horms horms@vergenet.net 2000-12-06

Assume that packets from out local network (192.168.0.0/23) are outgoing traffic.

Mark all outgoing packets with fwmark 1

ipchains -A input  -s 192.168.0.0/23 -m 1
# Now, set up a virtual service to act on the marked packets
ipvsadm -A -f 1
ipvsadm -a -f 1 -r 192.168.1.7
ipvsadm -a -f 1 -r 192.168.1.8
ipvsadm -a -f 1 -r 192.168.1.9

Where 192.168.1.7, 192.168.1.8 and 192.168.1.9 are your firewall boxen.

8.13 Example: LVS'ing a CIDR block

Matthew S. Crocker wrote:

would like to put a CIDR block of addresses (/25) through my LVS server. Is there a way I can set one entry for a VIP range and then the load balancing will be handled over the entire range.

Horms horms@vergenet.net 2001-01-13

Set up fwmark rules on the input chain to match incoming packets for the CIDR and mark them with a fwmark.

e.g.

ipchains -A input -d 192.168.192.0/24 -m 1

Use the fwmark (1 in this case) as the virtual service.

ipvsadm -A -f 1
ipvsadm -a -f 1 -r 10.0.0.1
ipvsadm -a -f 1 -r 10.0.0.2

8.14 Example: forwarding based on client source IP

client A (from 192.x.x.x) should go to real-server 1..3, and client B (from 10.x.x.x) should go to real-server 4..6.

(Julian, 10-05-2000)

Write fwmark rules based on the source IP of the packets. Then create two virtual services, one for each fwmark.

8.15 Example: load balancing multiple class C networks

Ian Courtney wrote:

Basically here at our ISP, we tend to have 2-3 Class C's worth of hosting per server. We would like to move the the LVS, but I'm not exactly sure how I should be setting it up.

Chris chris@isg.de 2001-01-15

You can use the fwmark option for the loadbalancing

#mark the incoming packets with ipchains
ipchains -A input -s 0.0.0.0/0 -d 192.168.0.0/24 -m 1
#then you can setup your LVS like
ipvsadm -A -f 1 -s wlc
ipvsadm -a -f 1 -r 10.10.10.15 -g
ipvsadm -a -f 1 -r 10.10.10.16 -g

the router should point to the director.

Ian Courtney wrote back:

It didn't work until I aliased all 3 class C's to my director. Do I have to do this?

Julian Anastasov ja@ssi.bg 2001-01-16

Yes, only the packets destined for local addresses/networks are accepted. The others are dropped or forwarded to another box.

the next project involves redoing our standard linux web space, which so far consists of about 8 webservers, each hosting atleast 2 class C's worth of hosting. I some how don't think Linux will take nicely to have 16 or more class C's aliased to it.

If possible use netmask <24. I assume you execute (replace with the right Class C nets):

ifconfig lo:1 207.228.79.0  netmask 255.255.254.0
ifconfig lo:2 207.148.155.0 netmask 255.255.255.0
ifconfig lo:3 207.148.151.0 netmask 255.255.255.0

on the director and on each real server and solve the arp problem using:

echo 1 > /proc/sys/net/ipv4/conf/all/hidden
echo 1 > /proc/sys/net/ipv4/conf/lo/hidden

in the real servers. If you don't want to advertise these addresses using ARP to the Cisco LAN, you can execute the above two commands in the director too.

8.16 Example: proxy server

Thomas Proell, 16 Aug 2000

How do you use fwmark if you want the director to accept packets for a wide range of addresses, for which is doesn't have IPs.

(Horms)

Here's a setup I used...

                                               Internet
                                                  |                             
                                                Router 192.168.128.1
"client"                  Linux Director          | 
  va2-------------------------va3-----------------+--------- proxy (va4)
192.168.16.3      192.168.16.1   192.168.128.2        192.168.128.5

I have used 192.168/16, but these could be real addresses too. I have only put one proxy server in the diagram but I did test it with 2

Client: default gw va3 (192.168.16.1)

Linux Director:
eth0: 192.168.128.2             (internet/proxy side)
eth1:  192.168.16.1             (client side)
Default gw: Router ,192.168.128.1
IPV4 forwarding enabled.

Ipvsadm rules - these can be translated into ldirectord configuration.
ipvsadm -A -f 1 -s wlc
ipvsadm -a -f 1 -r 192.168.128.3:0 -g -w 1
... add additonal proxy servers

Interestingly enough if you add a proxy that just forwads traffic then it will end up going direct. This may be useful as a failback server if the proxy servers fail.

ipchains -A input -s 0.0.0.0/0.0.0.0 -d 127.0.0.1/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 192.168.128.2/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 192.168.16.2/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 0.0.0.0/0.0.0.0 80 -p tcp -j REDIRECT 80 -m 1

The -m 1 means that IPVS will regognise packets patched by this filter as belonging to the virtual service as long as it sees the packets as local. -j REDIRECT 80 makes the packets appear as local. It is of note that the port you redirect to is _ignored_ because of the way IPVS works - paickets using fwmark are sent to the port they arrived on. This means that packets will be sent to proxy servers as port 80 traffic.

Proxy:
eth0: 192.168.128.5
Default gw: 192.168.128.1 (router)
IPV4 forwarding enabled.

ipchains -A input -s 0.0.0.0/0.0.0.0 -d 127.0.0.1/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 192.168.128.5/255.255.255.255 -j ACCEPT
ipchains -A input -s 0.0.0.0/0.0.0.0 -d 0.0.0.0/0.0.0.0 80:80 -p 6 -j REDIRECT +8080

Note, this is where the redirection to port 8080 takes place.

8.17 Example: transparent web cache

On Mon, May 08, 2000 at 11:18:08AM +0700, Pongsit@yahoo.samart.co.th wrote:

> If i would like to use LVS to balance 3 transparent proxy is this how i do it ?
>
>
>                 Internet
>                    |
>                    |
> ------------------------------------------- hub 1
>          |          |           |
>          |eth0      |           |           proxy1 ,2 and 3 set as a
>        proxy1     proxy2      proxy3        transparent proxy with firewall
>          |eth1      |           |           where eth0 connect to internet
>          |          |           |           and eth1 to the internal network
> ___________________________________________
>             |          |     |     |    |    hub 2
>             |          |     |     |    |
>          LVS/DR       client machines   |
>                                         |
>                                         |
> ___________________________________________  hub 3 if i have more internel
>                                                   users
>

Horms horms@vergenet.net 2000-05-08

If you want to do transparent proxying then I would suggest a topology more along the lines of:

                 Internet
                    |
                    |
------------------------------------------------ hub 1
                    |
                    |
                 LVS/DR
                    |
                    |
________________________________________________
   |      |      |      |     |     |    |    hub 2
   |      |      |      |     |     |    |
 proxy1 proxy2 proxy3  client machines   |
                                         |
                                         |
_________________________________________________ 
                                               hub 3 if i have more internel users

Use IP chains mark all outgoing port 80 traffic, other than from the 3 proxy servers with firewall mark 1 (ipchains -m 1...).

Set up a IPVS virtual service matching of fwmark 1 (ipvsadm -A -f 1...).

The proxy servers will need to be set up to recognise all port 80 traffic forwarded to them as local.

This way all outgoing traffic hits the LVS box. If it is for port 80 and isn't from one of the proxy servers then it gets load balanced and forwarded to one of the proxy servers.

You may want to consider a hot standby LVS/DR host to eliminate a single point of failure on your network.

I haven't tested this but I think it should work.

8.18 Example: Dynamically generated images in webpages

One of the assumptions of setting up an LVS is that the content presented on the real-servers is identical. This is required because the client can be sent to any of the real-servers. This requirement is not handled if the client fills in a form which produces a gif on the real-server.

Alois Treindl alois@astro.ch 30 Apr 2001

If a page is created by a CGI and contains dynamically created GIFs, the requests for these gifs will land on a different realserver than the one where the cgi runs. Will I need persistence?

I am running an astrology site; a typical request is to a CGI which creates an astrological drawing, based on some form data; this drawing is stored as a temporary GIF file on the server. A html page is output by the CGI which contains a reference to this GIF.
The browser receives the html, and then requests the GIF file from the server. It will mostly hit a different server than the one who created the GIF.
So either we make sure that the new client request for the GIF hits the same realserver which ran the CGI (i.e. have persistence) or we must create the GIF on a shared directory, so that each realserver sees it.
I have not tested it yet (not ported the CGIs yet to the new LVS box) but I think things are not so simple. In a 'rr' scheduling configuration, for example, the scheduler could play dirty, depending on the number of http requests for the given page, and the number of realservers. Both could be incommensurable in a way that the http request for the GIF never reaches the same realserver as the one which ran the CGI request.
I had already decided that I need shared directories between all real servers for our CGI environment which does computationally expensive things all the time. Some CGIs create also data files which are used by later CGIs. It is either shared directories for such files, or a shared database (which we also use).
These temp files will be sitting in the RAM cache of the NFS server, so that only network bandwidth between the realservers and the NFS server is the limiting factor. This is why I give the NFS server 2 gb of RAM, the max it will physically take, and this is why I chose 2.2.19 as the kernel because it contains NFS-3, which is said to be faster than NFS-2.

(Joe)

I tested it here on a page which generates a gif for the client. I found that I could never get the gif. Presumably after downloading the page containing the reference to the gif, the round robin scheduler sends the request for the gif to another real-server.

Presumably even page counters will have this problem. Writing to a shared directory should work.

Here's a solution with persistent fwmark using ip_tables to setup on a 2.4.x kernel. (Note: for page counters, this method will increment for each real-server, and not for the total page count over all the real-servers as would happen with a shared directory.)

#put fwmark=1 on all tcp packets for VIP:http arriving on eth0
director:# iptables -t mangle -A PREROUTING -i eth0 -p tcp -s 0.0.0.0/0 -d 192.168.1.110/32 --dport http -j MARK --set-mark 1
#setup a 2 real-server LVS to persistently forward packets with fwmark=1 using rr scheduling.
director:# ipvsadm -A -f 1 -s rr -p 600
director:# -a -f 1 -r sneezy.mack.net:0 -g -w 1
director:# -a -f 1 -r bashfull.mack.net:0 -g -w 1
#output setup
director:# iptables -L -t mangle
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
MARK       tcp  --  anywhere             lvs.mack.net       tcp dpt:http MARK set 0x1 
director:# ipvsadm
IP Virtual Server version 0.2.11 (size=16384)                  
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0            Route   1      0          0         
  -> sneezy.mack.net:0              Route   1      0          0

Here's the output of ipvsadm after the successful generation and display of the dynamically generated gif. Note all connections went to one real-server.

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.11 (size=16384)                  
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
FWM  1 rr persistent 600
  -> bashfull.mack.net:0            Route   1      5          3         
  -> sneezy.mack.net:0              Route   1      0          0

8.19 Appendix 1: Specificiations for grouping of services with fwmarks

Here are the discussions that has resulted in the current specifications for handling of persistence with fwmarks in LVS.

Ted Pavlic Jul 14, 2000

What I was asking about would be something like this:
virtual=192.168.6.2-192.168.6.30:80
         real=192.168.6.240:80 gate
         service=http
         request="index.html"
         receive="Test Page"
         scheduler=rr
I have 1029 virtual servers -- that is I have 1029 hosts which need to be load balanced.

Horms horms@vergenet.net 2000-07-14

(fwmark) has the advantage of simplfying the amount of _kernel_ configuration that has to be done which is a big win, even if this is automated by a user space application. The basic idea is that this provides a means for LVS to have virtual services that have more than one host/port/protocol triplet. In your situation this means that you can have a single virtual service that handles many virtual IP addresses and all ports and protocols (UDP, TCP and

You should take a look at ultramonkey examples (note from Joe, UM is now 1.0.2, look for examples there). My understanding is that this is quite similar to how your LVS topology will be set up, though I understand you will be having more than one of these configured.

Basically what happens is that you set up LVS to consider any packets like other LVS virtual services other than that no VIP is specified.

e.g.

ipvsadm -A -f 1 -s rr
ipvsadm -a -f 1 -r 192.168.6.3:80 -m
ipvsadm -a -f 1 -r 192.168.6.2:80 -m
ipvsadm -L -n
IP Virtual Server version 0.9.11 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 rr
  -> 192.168.6.3:80              Masq    1      0          0
  -> 192.168.6.2:80              Masq    1      0          0

The other half of the equation is that ipchains is used to match incoming traffic for virtual IP addresses and mark them with fwmark 1. Say you have 8 contiguous class C's of virtual addresses beginning at 192.168.0.0/24. The ipchains command to set up matching of these packets would be:

ipchains -A input -d 192.168.0.0/21 -m 1

You also need to set up a silent interface so that the LVS box sees traffic for the VIPs as local. To do this use:

ifconfig lo:0 192.168.0.0 netmask 255.255.248.0 mtu 1500
echo 1 > /proc/sys/net/ipv4/conf/all/hidden
echo 1 > /proc/sys/net/ipv4/conf/lo/hidden

Now, as long as 192.168.0.0/21 is routed to the LVS box, or more particularly the floating IP address of the LVS box brought up by heartbeat, traffic for the VIPs will be routed to the LVS box, the ipchains rules will mark it with fwmark 1 and LVS will see this fwmark and consider the traffic as destined for a virtual service.

Ted Jul 14, 2000

for me to enable persistent connections to every port using direct routing, would this work?
ipvsadm -A -f 1 -s rr -p 1800
ipvsadm -a -f 1 -r 216.69.192.201:0 -g
ipvsadm -a -f 1 -r 216.69.192.202:0 -g

Horms

Yes, that would work. The port in the "ipvsadm -a" commands is ignored if the real servers are being added to a fwmark service. Connections will be sent to the port on the real server that they will be recieved on the virtual server. So port 80 traffic will go to port 80, port 443 traffic will go to port 443 etc...

As a caveat you should really make sure that your ipchains statments catch all traffic for the given addresses including ICMP traffic so ICMP traffic is handled correctly by LVS.

(Julian on catching ICMP traffic)

IIRC, this is already not a requirement in the last LVS versions. If we look in skb->fwmark for ICMP packets it is impossible to use normal and fwmark virtual services to same VIP because we can't create such ipchains rules. The good news is that in 2.4 (0.0.3) the virtual service lookup (the fwmark field) is used only for the new connections. In 2.2 the service is looked up even for existing entries but we don't want to break the MASQ code entirely

Ted Pavlic tpavlic@netwalk.com 19 Jul 2000

When using fwmark to assign real servers to virtual servers, how is scheduling and persistence handled?

In my particular example, I have: 216.69.196.0/22 (ie 4 class C networks) all marked with a fwmark of 1. ipvsadm setup is

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 lc persistent 600
-> nw01:0                      Route   1      0          0
-> nw02:0                      Route   1      0          0

Say someone connects to 216.69.196.1 and the connection is assigned to nw01. At this point ipvsadm shows

Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 lc persistent 600
-> nw01:0                      Route   1      1          0
-> nw02:0                      Route   1      0          0

A new person connects to another IP in 216.69.196.0/22 (say 216.69.196.2). Will this new connection to 216.69.196.2 go to nw02 because it has the least number of TOTAL connections, or will it go to nw01 because for that PARTICULAR IP, both have 0 connections?

Now then say that the person who just connected to 216.69.196.1 makes a connection (within the 600 persistence seconds) to 216.69.196.3. Will this new connection go to nw01 because it's being persistent? Or will it go to either server depending on the number of connections?

Here's what I think would be the best way to do things...

If multiple IPs are marked with FWMARK 1, LVS should consider them all one entry in its active/inactive table. I don't believe that's how things are currently being handled.

(Julian)

The templates are not accounted in the active/inactive counters.

(Joe, almost a year later - Julian, what do you mean here?) (Julian 13 Apr 2001)

Ted here thinks that the templates are accounted in the inactive/active counters. And before the persistent-fwmark patch we can have many templates for one fwmark-based service:
CIPNET:0 -> VIP1:0 -> RIP_A:0
CIPNET:0 -> VIP2:0 -> RIP_B:0
where VIP1 and VIP2 are marked with same fwmark.
Ted recommends these two templates to be replaced with one, i.e. just like in the persistent-fwmark patch:
CIPNET:0 -> FWMARK:0 -> RIP1:0

We can't see the templates (which are normal connection entries with some reserved values in the connections structure fields) accounted in the inactive/active counters. The reason for this is that the inactive/active counters are used to represent the real server load but our templates don't lead to any load in the real servers, we use them only to maintain the persistence.

When a service is marked persistent all connections from CIP to VIP go to same RIP for the specified period. Even for the fwmark based services. This works for many independent VIPs.

The other case is fwmark service covering a DNS name. I expect comments from users with SSL problems and persistent fwmark service. Is there a problem or may be not?

I agree, may be the both cases can be useful:

1. CIP->VIP
2. CIP->FWMARK

Any examples where/why (2) is needed?

But switching the LVS code always to use (2) for the persistent fwmark services is possible.

(Ted)

In my opinion, here are some pros and cons of case 2:

Pros:

Improves scheduling, I think, and true load balancing. If someone is using [W]RR or [W]LC, the LVS box will actually look at the real servers as a whole rather than separate real server entries for EACH VIP. Does that make sense?

For example, in my particular configuration I have over one thousand VIPs which are load balanced onto four RIPs. When I configure the LVS server to use LC scheduling, I'd like it to look at how many TOTAL connections are being made to each RIP not how many connections are being made to each RIP PER VIP. I would like to load balance all one thousand VIPs as a WHOLE onto the four RIPs rather than load balance EACH VIP.

That is, in some of my less active sites, most of their traffic will probably hit one VIP just because not much traffic will need to be load balanced. However, more active sites will hit both servers. The load will then not be distributed equally among the servers as one server will probably get not only the active traffic but also the less active traffic and the other server will only get the more active traffic (in the case of having two RIPs).

Cons:

One person on the Internet will keep connecting to the same RIP for many different VIPs if persistence is turned on.

If this causes a problem, the LVS administrator can do one of two different things:

1) Rather than load balancing a fwmark template, go back to load balancing specific VIPs. The scheduling will then be unique for those particular VIPs.

2) Create multiple fwmark templates. The scheduling for each template will be unique.

In my opinion if you group a bunch of IPs together by marking them with an fwmark, that you say that you want to load balance all of those COLLECTIVELY -- almost like load balancing one site.

I'm just saying, are there any examples where CIP->FWMARK is not needed?

As far as the LVS is concerned, if someone connects to a VIP marked with fwmark 1, it should treat it just like every other VIP marked with fwmark 1 -- as if they were all one VIP.

But today on my LVS (where I have a ten minute persistence setup) I connected to one virtual server marked with fwmark 1 and got a certain real server. I then expected to connect to another virtual server also marked with fwmark 1 and get that same real server. I did not, however. If what you're telling me is correct, the persistence should have connected me to the same real server as long as I was connecting within that ten minute window.

Now in this particular example -- connecting to DIFFERENT virtual servers -- it isn't so necessary for persistence to be carried through PER virtual server. I'm just worried that least connection scheduling and round-robin scheduling aren't working at the fwmark level -- I'm worried that they are working at the VIP level as if I had setup hundreds of explicit VIP rules inside IPVSADM.

(Julian) > In my opinion, here are some pros and cons of case 2:

I hope this feature (2) will be implemented in the next LVS version (if Wensong don't see any problems). I.e. the templates can be changed to case (2) for the persistent fwmark services. For now we (I and Horms) don't see any problems after this change. Then connections from one client IP to different VIPs (from the same fwmark service) will go to the same real server (only for the persistent fwmark services).

(Julian)

> Do you see any reason why enabling CIP->FWMARK for all cases would be a bad
> thing?
> 
> That is, not only using case 2 for persistent fwmark, but just whenever
> fwmark was used. Personally, I cannot ever forsee a scenerio when a person
> would setup an fwmark for load balancing and want each VIP associated with
> that fwmark to act independently. <?>

Web cluster for independent domains (VIPs). fwmark service is used only to reduce the amount of work for configuration.

> I've always thought that the scheduling algorithms should look directly at
> the real servers rather than the real server stats for each particular
> virtual server. That is, least connection scheduling would look at the total
> number of connections on a real server, not just the connections from that
> particular VIP. Round-robin would go round-robin from real server to real
> server based on the last connection from ANY VIP to the real servers...
> However, before fwmark I realized that this would probably very difficult to
> do especially in cases where an LVS administrator was load balancing to a
> number of different real server clusters that may overlap.

This is a job for the user space tools: WRR scheduling method + weights derived from the real server load. Yes, one real server can be loaded from:

many directors
many virtual services
other processes not part from the real service

In this case the director's opinion (for each virtual service) about the real server load is wrong. The only way to handle such case properly is to use WRR method. In the other cases WLC, LC and RR can do their job.

fwmark, to me, just by causing all VIPs marked with a particular fwmark to look like one big VIP makes it possible to do basically that which I just described. I don't see why anyone would not want such functionality with the fwmark services. If one did want such functionality, he would probably partition the VIPs associated with his fwmark into separate fwmarks or even explicit VIP entries anyway.

Yes. IMO, this can be a problem only for the balancing but I don't think so. The problems will come when one real server dies and the client can't access any VIP part from the fwmark service for a period of time.

8.20 Appendix 2: Demonstration of grouping services with fwmarks

Here's the original e-mail between Ted tpavlic@netwalk.com 3 Aug 2000 and Joe

One of the things it fwmarks lets me do is make ports sticky by groups.

Basically I setup ipchains rules that say all packets to ports 80/tcp and 443/tcp mark with a 1. All packets to ports 20/tcp and 21/tcp as well as 1024:65535/tcp mark with a 2. Voila... I just made ports stick by groups.

I then go into IPVS and setup my real servers under FWMARK1 and FWMARK2. Ports 80 and 443 are now persistent as a group just as 20 and 21 and 1024:65535 are persistent as a group. If my HTTP goes down on one of my real servers, I do not have to take my FTP down as well. I only have to remove the real server from the FWMARK1 group. It's great!

Joe

> most people don't program their own on-line transaction processing program
> and the point of an LVS is for the real-servers to be running the same code 
> as when they're stand alone.

My users run PHP scripts as well as ASPs that keep session information. That session information is unique per server and usually is stored in a local /tmp directory. Users are handed cookies which tie them to their session information. If they go to the wrong real server, that session information won't exist and a number of things could go wrong.

most of my real servers run a lot of services... HTTP, HTTPS, FTP, SMTP, POP3, IMAP, DNS, And when one of them went down (with persistence set up), I would have to take the entire real server down.

Several problems:

*) One little thing goes down... POP3, for example. Now the load increases a great deal on all my other real servers... Perhaps causing the load to become so high that sendmail starts rejecting connections... and then THAT real server also is taken COMPLETELY down... domino effect. If I could have just taken POP3 down off of that server, it would have been perfect.

*) Say something horrible happens causing sendmail to go down on all the servers... or HTTP... or POP3... any one service -- just as long as it goes down on all servers. Rather than just causing that service to be affected, ALL of my services go down because every real server was taken completely off-line until that ONE service is fixed. :(

But I figured that those two problems wouldn't be that big of a deal... I could probably put such a system in production.

Well -- I put such a system in production and those problems weren't that big of a deal... Except for a COUPLE of times when all services went down and caused a BIG hassle. So my superiors wanted something better -- needed in fact.

So at first I came up with the interim idea of separating persistent services and non-persistent services by IP. All of my persistent services were basically on one supernet and all of my non-persistent services were on another subnet. Consequently, I could tie the one supernet to one FWMARK and the other subnet to another FWMARK. Now if a persistent service went down, it would bring down only all of the persistent services. Also, if a non-persistent service went down, it would only bring down all of the non-persistent services.

This was definitely an interim solution because it required a lot more IPs that any one administrator should need, and it still was far from perfect.. BUT... I started to realize that just as I could mark different supernets and subnets with different FWMARKs, I could go farther down the TCP/IP layers and mark things at their protocol and port level. That's where I realized that we COULD do persistents by port group just with a little help from ipchains.

> I asked Horms if there was any point in having multiple fwmarks.
> His only example was if you had duplicate sets of realservers. Eg the
> paying customers get the fast servers, while the people coming into the
> free site get the 486 with 16M.

Similar idea here... except rather than setting up your policies like:

* Paying customers -> fast server
* Free -> slow server

You have:

* SMTP -> a real-server
* POP3 -> another real-server
* HTTP/HTTPS -> yet another real-server
* FTP -> and another real-server

The key of it all is the fact that you can group by about any parameter that ipchains can see. If ipchains can segregate it, you can group it. Anything that ipchains can do IPVS can then add onto itself.

> Have you solved passive ftp without using persistance?

I really don't think there's any way to get around it... In order to get passive FTP to work, you need to make TCP port 21 persistent with every TCP port above 1024. I mean -- how else could you do it without putting some big brother software inside of LVS which would keep an eye on FTP and see what port it tells the end-user to connect to.

Still, putting 21 and 1024:65535 together is a lot better than putting everything together. Personally I only plan on load balancing things in the < 1024 range anyway, so I have no problem including that huge group above 1024.

This is my setup

FWMARK1 => HTTP/HTTPS (persistent)
FWMARK2 => FTP (persistent)
FWMARK3 => SMTP
FWMARK4 => POP3
FWMARK5 => DOMAIN
FWMARK6 => IMAP
FWMARK7 => ICMP (for kicks)

================
IP Virtual Server version 0.9.12 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
FWM  1 lc persistent 600
  -> nw04:0                      Route   1      58         121
  -> nw03:0                      Route   1      49         76
  -> nw02:0                      Route   1      60         98
  -> nw01:0                      Route   1      61         44
FWM  2 lc persistent 600
  -> nw04:0                      Route   1      0          2
  -> nw03:0                      Route   1      0          2
  -> nw02:0                      Route   1      1          13
  -> nw01:0                      Route   1      1          0
FWM  3 lc
  -> nw04:0                      Route   1      4          11
  -> nw03:0                      Route   1      4          12
  -> nw02:0                      Route   1      3          20
  -> nw01:0                      Route   1      3          16
FWM  4 lc
  -> nw04:0                      Route   1      3          54
  -> nw03:0                      Route   1      1          74
  -> nw02:0                      Route   1      3          51
  -> nw01:0                      Route   1      2          73
FWM  5 lc
  -> nw03:0                      Route   1      0          46
  -> nw01:0                      Route   1      0          44
  -> nw02:0                      Route   1      0          45
  -> nw04:0                      Route   1      0          45
FWM  6 lc
  -> nw04:0                      Route   1      0          0
  -> nw03:0                      Route   1      0          0
  -> nw02:0                      Route   1      1          0
  -> nw01:0                      Route   1      0          0
FWM  7 lc
  -> nw04:0                      Route   1      0          0
  -> nw03:0                      Route   1      0          0
  -> nw02:0                      Route   1      0          0
  -> nw01:0                      Route   1      0          0
==============

Is this anything new?

> It's new to me and Horms didn't have any other ideas for multiple
> fwmarks 3 weeks ago, so I expect it will be new to him.

I've been thinking of ways of combining different programs which already exist out there to get L7 scheduling working. For exmaple -- you have some program (sorta like policy routing but one more layer up) that filters packets at the application layer and does something to them... routes them to a particular IP... something like that... and then have ipchains mark each one of those packets with a particular mark... and have LVS work from there.

You see -- using multiple fwmarks makes me think that you can do a lot more with LVS.

We could probably borrow some of the ideas used for some of the dynamic routing protocols, like BGP or RIP. A master could advertise its IPVS hash table. If it didn't advertise within a given interval of time, other LVS's could take over.

During the failover, rather than trading an IP like we were talking about, all LVSs could know which one is the active one and ICMP redirect to that LVS or something like that.

Right now I'm routing every virtual server through the active LVS. This lets me do a lot of nifty things (for me at least):

* Very little has to happen on the LVS during failovers. They basically just trade an IP. In fact -- I COULD do the failover right at the router before the LVS's -- just have it route to another IP.

* I do not have to bring every IP up on my real-servers -- I just have to bring the network that they're on up on a hidden loopback device. When you route an entire network to a loopback device, the loopback device answers every IP on that subnet automatically. So even with 1024+ IPs, I have to setup very few interfaces/aliases because a great deal of them are on the same subnet.

8.21 Appendix 3: Announcement of grouping services with fwmarks

Ted Pavlic tpavlic@netwalk.com 4 Aug 2000

Periodically the issue comes up regarding wanting to do persistence by groups of ports. Until now, an LVS administrator could make a single port persistent or all ports persistent.

Single port persistence was nice for quite a few things. However, things like HTTP and HTTPS caused complications with it. Someone who connected to a webpage on HTTP and started a session tied to them with a cookie would want to return to that same real server when they went to the HTTPS version of that site. FTP would also cause a problem with single port persistence as someone who wanted to use passive FTP wouldn't be gauranteed the same server when they returned on a random TCP port above 1024. There are other examples as well.

So the solution to these problems would be to make every port persistent. This works pretty well, but now anytime a user of a large network behind a firewall would connect to a real server on ANY service, everyone behind that firewall would hit that same real server. Plus, if an administrator wanted to stop scheduling a single service to a single real server, he would have to take all services down on that single real server. This causes many problems as well... especially if one small service dies on every real server -- brings down every service on every real server.

So there has been the need for persistence by port GROUPS. Rather than saying all ports are persistent, it would be nice to tell LVS to tie just 80/tcp and 443/tcp together or just 21/tcp and 1024:65535/tcp together. Before the wonderful FWMARK additions to LVS, this was not possible.

But now that LVS listens to FWMARKs, it becomes possible to group ports together inside ipchains with different FWMARKs and then tell LVS to listen to those FWMARKs.

For example, one can setup a rule inside FWMARK to do this...

80/tcp, 443/tcp --> FWMARK1
21/tcp, 1024:65535/tcp --> FWMARK2
25/tcp --> FWMARK3
110/tcp --> FWMARK4

Then inside LVS (assume on this setup all of these services are served by the same real server cluster), say:

FWMARK1 -> PERSISTENT -> real1,real2,real3,real4
FWMARK2 -> PERSISTENT -> real1, real2, real3, real4
FWMARK3 -> real1, real2, real3, real4
FWMARK4 -> real1, real2, real3, real4

Not only have you now setup persistence by port groups, but you've also split your services back up into autonomous services that will not bring EVERY server down for the sake of persistence. If FTP goes down on real1, real1 only needs to be stopped scheduling for FTP.

another explanation

Ted Pavlic tpavlic@netwalk.com 2000-09-15

Using fwmark, you can setup something which used to be a big desire in LVS, persistence by port groups.

For example... Say you were serving HTTP and HTTPS. In this case, you would probably want calls to one HTTP server to end up hitting the same HTTPS server. This way session information and such would be accessable no matter how the end-user was accessing the website.

Say you also wanted all forms of FTP to work... You would need persistence there, but not necessarily the same persistence as HTTP/HTTPS.

And other protocols do not need to be persistent.

Back in the olden days before fwmark, to do any of this you would have to make ALL ports persistent. You couldn't simply say "Group 80 and 443 together and make them persistent and then make 21, 20, AND 1024:65535 persistent." If one service went down, you would have to bring down ALL services. Some sort of persistence by port groups would allow you to only need to take down whatever went down and the affected server could still serve other services.

FWMARK allows you to do this by way of setting up multiple FWMARKs.

That is -- you can use ipchains to say that:

HTTP,HTTPS --> FWMARK1
FTP --> FWMARK2
SMTP --> FWMARK3
POP --> FWMARK4

Then in LVS, setup:

FWMARK1 --> WLC Persistent 600
FWMARK2 --> WLC Persistent 300
FWMARK3 --> WLC
FWMARK4 --> WLC

And if FTP went down, all you'd have to do is stop scheduling FTP rather than stop scheduling EVERYTHING.

Also note that FWMARK makes setting up MASS VIPs really easy (of course because of recent ARIN policy changes, this probably won't be done much more anymore). That is, if you wanted to load balance 1000 VIPs, it might be easy to setup one single rule in ipchains to cover them all, where it would be 1000 rules for EACH real server in ipvsadm.

It makes me think that if there was a utility already out there that could sit on a director and figure out where name-based packets were going it might be able to mark each name-based host with a different FWMARK and pass that right back to LVS... Then LVS wouldn't have to worry about handling name-based stuff ITSELF. Of course the name-based challenge is even more challenging considering how much data needs to be looked at to figure out if a TCP stream is a name-based HTTP session going to specific name X.... but that's a completely other argument... Just food for thought.

Next Previous Contents