LVS-HOWTO: VS-NAT

VS-NAT is based on cisco's LocalDirector.

This method was used for the first LVS. (If you want to set up a test LVS, this is still probably the simplest setup.)

With VS-NAT, the incoming packets are rewritten by the director to have the destinatation address of one of the real-servers and then forwarded to the real-server. The replies from the real-server are sent to the director where they are rewritten have the source address of the VIP.

Unlike the other two methods of forwarding used in an LVS (VS-DR and VS-Tun) the real-server only needs a functioning tcpip stack (eg a networked printer). I.e. the real-server can have any operatining system and no modifications are made to the configuration of the real-servers (except setting their route tables).

11.2 Example Two Network VS-NAT (VIP and RIPs on different network)

Here the client is on the same network as the VIP (in a production LVS, the client will be coming in from an external network via a router). The director can have 1 or 2 NICs (two NICs will allow higher throughput of packets, since the traffic on the real-server network will be separated from the traffic on the client network).

Machine                      IP
client                       CIP=192.168.1.254
director VIP                 VIP=192.168.1.110 (the IP for the LVS)
director internal interface  DIP=10.1.1.1
real-server1                 RIP1=10.1.1.2
real-server2                 RIP2=10.1.1.3
real-server3                 RIP3=10.1.1.4
.
.
real-serverN                 RIPn=10.1.1.n+1
directore-inside             DIIP=10.1.1.9 (director interface on the VS-NAT network)


                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                        (router)
                           |
             __________    |
            |          |   |   VIP=192.168.1.110 (eth0:110)
            | director |---|   DIP=10.1.1.1 (eth0)
            |__________|   |   DIIP=10.1.1.9 (eth0:9)
                           |
                           |
          -----------------------------------
          |                |                |
          |                |                |
   RIP1=10.1.1.2      RIP2=10.1.1.3   RIP3=10.1.1.4 (all eth0)
   _____________     _____________    _____________
  |             |   |             |  |             |
  | real-server |   | real-server |  | real-server |
  |_____________|   |_____________|  |_____________|

here's the lvs.conf file for this setup

LVS_TYPE=VS_NAT
INITIAL_STATE=on
VIP=eth0:110 lvs 255.255.255.0 192.168.1.255
DIRECTOR_INSIDEIP=eth0 director_inside 192.168.1.0 255.255.255.0 192.168.1.255
DIRECTOR_DEFAULT_GW=client
SERVICE=t telnet rr real-server1:telnet real-server2:telnet real-server3:telnet
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=director_inside
#----------end lvs_nat.conf------------------------------------

The VIP is the only IP known to the client. The RIPs are on a different network to the VIP (although with only 1 NIC on the director, the VIP and the RIPs are on the same wire).

In normal NAT, masquerading is the rewriting of packets originating behind the NAT box. With VS-NAT, the incoming packet (src=CIP,dst=VIP, abbreviated to CIP->VIP) is rewritten by the director (becoming CIP->RIP). The action of the LVS director is called demasquerading. The demasqueraded packet is forwarded to the real-server. The reply packet (RIP->CIP) is generated by the real-server.

11.3 All packets from the real-server to the outside world must go through the director

For VS-NAT to work

The default gw for the real-servers must be the director.
That's not enough: all packets from the real-servers must go through the director.

This is the single most common cause of problems setting up a VS-NAT LVS.

In a normal server farm, the default gw of the real-server would be the router to the internet and the packet RIP->CIP would be sent directly to the client. In a VS-NAT LVS, the default gw of the real-servers must be the director. The director masquerades the packet from the real-server (rewrites it to VIP->CIP) and the client receives a rewritten packet with the expected source IP of the VIP.

Note: the packet must be routed via the director, there must be no other path to the client. A packet arriving at the client directly from the real-server will not be seen as a reply to the client's request and the connection will hang. If the director is not the default gw for the real-servers, then if you use tcpdump on the director to watch an attempt to telnet from the client to the VIP (run tcpdump with `tcpdump port telnet`), you will see the request packet (CIP->VIP), the rewritten packet (CIP->RIP) and the reply packet (RIP->CIP). You will not see the rewritten reply packet (VIP->CIP). (Remember if you have a switch on the real-server's network, rather than a hub, then each node only sees the packets to/from it. tcpdump won't see packets to between other nodes on the same network.)

Part of the setup of VS-NAT then is to make sure that the reply packet goes via the director, where it will be rewritten to have the addresses (VIP->CIP). In some cases (e.g. 1 net NS-NAT) icmp redirects have to be turned off on the director so that the real-server doesn't get a redirect to forward packets directly to the client.

In a production system, a router would prevent a machine on the outside exchanging packets with machines on the RIP network. As well, the real-servers will be on a private network (eg 192.168.x.x/24) and will not be routable.

In a test setup (no router), these safeguards don't exist. All machines (client, director, real-servers) are on the same piece of wire and if routing information is added to the hosts, the client can connect to the real-servers independantly of the LVS. This will stop VS-NAT from working (your connection will hang), or it may appear to work (you'll be connecting directly to the real-server).

In a test setup, traceroute from the real-server to the client should go through the director (2 hops). The configure script will test that the director's gw is 2 hops from the real-server and that the route to the director's gw is via the director, hopefully to prevent this type of error.

In production you should _not_ be able to ping from the real-servers to the client. The real-servers should not know about any other network than their own (here 10.1.1.0). The connection from the real-servers to the client is through ipchains (for 2.2.x kernels) and VS-NAT tables setup by the director.

In my first attempt at VS-NAT setup, I had all machines on a 192.168.1.0 network and added a 10.1.1.0 private network for the real-servers/director, without removing the 192.168.1.0 network on the real-servers. All replies from the servers were routed onto the 192.168.1.0 network rather than back through VS-NAT and the client didn't get any packets back.

The VS-NAT setup can have a separate NIC for the DIP and the VIP putting the real-server network and the LAN for the VIP on different wires (the director could be a firewall for the real-servers). This should prevent real-servers routing packets directly to the client (at least it has for me).

Here's the general setup I use for testing. The client (192.168.2.254) connects to the VIP on the director. (The VIP on the real-server is present only for VS-DR and VS-Tun.) For VS-DR, the default gw for the real-servers is 192.168.1.254. For VS-NAT, the default gw for the real-servers is 192.168.1.9.

        ____________
       |            |192.168.1.254 (eth1)
       |  client    |----------------------
       |____________|                     |
     CIP=192.168.2.254 (eth0)             |
              |                           |
              |                           |
     VIP=192.168.2.110 (eth0)             |
        ____________                      | 
       |            |                     |
       |  director  |                     |
       |____________|                     |
     DIP=192.168.1.9 (eth1, arps)         |
              |                           |
           (switch)------------------------
              |         
     RIP=192.168.1.2 (eth0)
     VIP=192.168.2.110 (for VS-DR, lo:0, no_arp)
        _____________
       |             |
       | real-server |
       |_____________|

This setup works for both VS-NAT and VS-DR.

Here's the routing table for one of the real-servers as in the VS-NAT setup.

bashfull:# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.0     0.0.0.0         255.255.255.0   U        40 0          0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U        40 0          0 lo
0.0.0.0         192.168.1.9     0.0.0.0         UG       40 0          0 eth0

Here's a traceroute from the real-server to the client showing 2 hops.

traceroute to client2.mack.net (192.168.2.254), 30 hops max, 40 byte packets
 1  director.mack.net (192.168.1.9)  1.089 ms  1.046 ms  0.799 ms
 2  client2.mack.net (192.168.2.254)  1.019 ms  1.16 ms  1.135 ms

icmp redirects are on at the director, but the director doesn't issue a redirect (see icmp_redirects ) because the packet RIP->CIP from the real-server emerges from a different NIC on the director than it arrived on (and with different source IP). The client machine doesn't send a redirect since it is not forwarding packets, it's the endpoint of the connection.

11.4 Run configure

Use lvs_nat.conf as a template (sample here will setup VS-NAT in the diagram above assuming the real-servers are already on the network and using the DIP as the default gw).

#--------------lvs_nat.conf----------------------
LVS_TYPE=VS_NAT
INITIAL_STATE=on  

#director setup: 
VIP=eth0:110 192.168.1.110 255.255.255.0 192.168.1.255
DIRECTOR_INSIDEIP=eth0:10 10.1.1.10 10.1.1.0 255.255.255.0 10.1.1.255

#Services on real-servers:
#telnet to 10.1.1.2
SERVICE=t telnet wlc 10.1.1.2:telnet
#http to a 10.1.1.2 (with weight 2) and to high port on 10.1.1.3 
SERVICE=t 80 wlc 10.1.1.2:http,2 10.1.1.3:8080 10.1.1.4

#real-server setup (nothing to be done for VS-NAT)

#----------end lvs_nat.conf------------------------------------

The output is a commented rc.lvs_nat file. Run the rc.lvs_nat file on the director and then the real-servers (the script knows whether it is running on a director or real-server).

The configure script will setup up masquerading, forwarding on the director and the default gw for the real-servers.

The packets coming in from the client are being demasqueraded by LVS.

In 2.2.x you need to masquerade the replies. Here's the masquerading code in rc.lvs_nat produced by configure.pl

        echo "turning on masquerading "
        #setup masquerading
        echo "1" >/proc/sys/net/ipv4/ip_forward
        echo "installing ipchain rules"
        /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 http -d 0.0.0.0/0
        #repeated for each real-server and service
        ..
        ..
        echo "ipchain rules "
        /sbin/ipchains -L

In this example, http is being masqueraded by the director, allowing the real-server to reply to the telnet requests from the director being demasqueraded by the director as part of the 2.2.x LVS code.

In 2.4.x masquerading of LVS'ed services is done explicitely by the LVS code and no extra ipchains commands need be run.

Telnet requests initiated by the real-server will go out through the director (the default gw for the real-server) without any masquerading (or demasquerading). Since the real-servers for a VS-NAT LVS will most likely have private IP's, telnet'ing may not get you very far. If you did want telnet to be masqueraded by the director, so that you can get to the outside world, then on the director you need to run a command like

        /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 telnet -d 0.0.0.0/0

Telnet will be now masqueraded out by the director, quite independently of the LVS setup.

With VS-NAT, the ports can be re-mapped. A request to port 80 on the director can be sent to port 8000 on a real-server. This is possible because the source and destination of the packets are already being rewritten and no extra overhead is required to rewrite the port numbers. The rewriting is slow (60usec/packet on a pentium classic) and limits the throughput of VS-NAT (for 536byte packets, this is 72Mbit/sec or about 100BaseT). While VS-NAT throughput does not scale well with the number of real-servers, the advantage of VS-NAT is that real-servers can have any OS, no modifications are needed to the real-server to run it in an LVS, and the real-server can have services not found on Linux boxes.

For the earlier versions of VS-NAT (with 2.0.36 kernels) the timeouts were set by linux/include/net/ip_masq.h, the default values of masquerading timeouts are:

        #define MASQUERADE_EXPIRE_TCP 15*16*Hz
        #define MASQUERADE_EXPIRE_TCP_FIN 2*16*Hz
        #define MASQUERADE_EXPIRE_UDP 5*16*Hz

11.5 How VS-NAT works

ipvsadm does the following

        #setup connection for telnet, using round robin
        /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr
        #connections to x.x.x.110:telnet are sent to
        #                 real-server 10.1.1.2:telnet
        #using VS-NAT (the -m) with weight 1
        /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.2:23 -m -w 1
        #and to real-server 10.1.1.3
        #using VS-NAT with weight 2
        /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.3:23 -m -w 2

(if the service was http, the webserver on the realhost could be listening on port 8000 instead of 80)

Example: client requests a connection to 192.168.1.110:23

director chooses real server 10.1.1.2:23, updates connection tables, then

packet                source                        dest
incoming              CIP:3456                      VIP:23
inbound rewriting     CIP:3456                      RIP1:23
reply (routed to DIP) RIP1:23                       CIP:3456
outbound rewriting    VIP:23                        CIP:3456

The client gets back a packet with the source_address = VIP.

For the verbally oriented...

The request packet is sent to the VIP. The director looks up its tables and sends the connection to real-server1. The packet is rewritten with a new destination (in this case with the same port, but the port could be changed too) and sent to RIP1. The real-server replies, sending back a packet to the client. The default gw for the real-server is the director. The director accepts the packet and rewrites the packet to have source=VIP and sends the rewritten packet to the client.

Why isn't the source of the incoming packet rewritten to be the DIP or VIP?

(Wensong) ...changing the source of the packet to the VIP sounds good too, it doesn't require that default route rule, but requires additional code to handle it.

11.6 In VS-NAT, how do packets get back to the client, or how does the director choose the VIP as the source_address for the outgoing packets?

(with Julian)

In normal NAT, where a bunch of machines are sitting behind a NAT box, all outward going packets are given the IP on the outside of the NAT box. What if there are several IPs facing the outside world? For NAT it doesn't really matter as long as the same IP is used for all packets. The default value is usually the first interface address (eg eth0).

With VS-NAT you want the outgoing packets to have the source of the VIP (probably on eth0:1) rather than the IP on the main device on the director (eth0).

With a single real-server VS-NAT LVS serving telnet, the client will send a packet

CIP:high_port -> VIP:telnet

This will be masqueraded by the director to

CIP:high_port -> RIP:telnet

The real-server generates a reply

RIP:telnet -> CIP:high_port

which arrives on the director (being sent there because the director is the default gw for the real-server).

To get the packet from the director to the client, you have to reverse the masquerading done by the LVS. To do this, on the director you add an ipchains rule

director:# ipchains -A forward -p tcp -j MASQ -s realserver1 telnet -d 0.0.0.0/0

If the director has multiple IPs facing the outside world (eg eth0=192.168.2.1 the regular IP for the director and eth0:1=192.168.2.110 the VIP), the masquerading code has to choose the correct IP for the outgoing packet. Only the packet with src_addr=VIP will be accepted by the client. A packet with any other scr_addr will be dropped. The normal default for masquerading (eth0) should not be used in this case. The required m_addr (masquerade address) is the VIP.

> Does LVS fiddle with the ipchains tables to do this?

No, ipchains only delivers packets to the masquerading code. It doesn't matter how the packets are selected in the ipchains rule.

The m_addr (masqueraded_address) is assigned when the first packet is seen (the connect request from the client to the VIP). LVS sees the first packet in the LOCAL_IN chain when it comes from the client. LVS assigns the VIP as maddr.

The MASQ code sees the first packet in the FORWARD chain when there is a -j MASQ target in the ipchains rule. The routing selects the maddr. If the connection already exists the packets are masqueraded.

The LVS can see packets in the FORWARD chain but they are for already created connections, so no maddr is assigned and the packets are masqueraded with the address saved in the connections structure (the VIP) when it was created.

Here's another version of the same conversation

Julian Anastasov ja@ssi.bg 01 May 2001

There are 3 common cases:

The connection is created as response to packet.
The connection is created as response to packet to another connection.
The connection is already created

Case (1) can happen in the plain masquerading case where the in->out packets hit the masquerading rule. In this case when nobody recommends the saddr for the packets going to the external side of the MASQ, the masq code uses the routing to select the maddr for this new connection. This address is not always the DIP, it can be the preferred source address for the used route, for example, address from another device.

Case (1) happens also for LVS but in this case we know:

the client address/port (from the received datagram)
the virtual server address/port (from the received datagram)
the real server address/port (from the LVS scheduler)

But this is on out->in packet and we are talking about in->out packets

Case (2) happens for related connections where the new connection can be created when all addresses and ports are known or when the protocol requires some wildcard address/port matching, for example, ftp. In this case we expect the first packet for the connection after some period of time.

It seems you are interested how case (3) works. The answer is that the NAT code remembers all these addresses and ports in a connection structure with these components

external address/port (LVS: client)
masquerading address/port (LVS: virtual server)
internal address/port (LVS: real server)
protocol
etc

LVS and the masquerading code simply hook in the packet path and they perform the header/data mangling. In this process they use the information from the connection table(s). The rule is simple: when a packet is already for established connection we must remember all addresses and ports and always to use same values when mangling the packet header. If we select each time different addresses or ports we simply break the connection. After the packet is mangled the routing is called to select the next hop. Of course, you can expect problems if there are fatal route changes.

So, the short answer is: the LVS knows what maddr to use when a packet from the real server is received because the connection is already created and we know what addresses to use. Only in the masquerading case (where LVS os not involved) connections can be created and a masquerading address to be selected without using rule for this. In all other cases there is a rule that recommends what addresses to be used at creation time. After creation the same values are used.

11.7 Performance of VS-NAT

The throughput of VS-NAT is limited by the time taken by the director to rewrite a packet. The limit for a pentium classic 75MHz is about 80Mbit/sec (100baseT). Increasing the number of real-servers does not increase the throughput.

The performance page shows a slightly higher latency with VS-NAT compared to VS-DR or VS-Tun, but the same maximum throughput. The load average on the director is high (>5) at maximum throughput, and the keyboard and mouse are quite sluggish. The same director box operating at the same throughput under VS_DR or VS-Tun has no perceptable load as measured by top or by mouse/keyboard responsiveness.

11.8 One network VS-NAT

The disadvantage of the 2 network VS-NAT is that the real-servers are not able to connect to machines in the network of the VIP. You couldn't make a VS-NAT setup out of machines already on your LAN, which were also required for other purposes to stay on the LAN network.

Here's a one network VS-NAT LVS.

                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                           |
             __________    |
            |          |   |   VIP=192.168.1.110 (eth0:110)
            | director |---|   DIP=192.168.1.1 (eth0)
            |__________|   |   DIIP=192.168.1.9 (eth0:9)
                           |
                           |
          ------------------------------------
          |                |                 |
          |                |                 |
   RIP1=192.168.1.2   RIP2=192.168.1.3  RIP3=192.168.1.4 (all eth0)
    _____________      _____________     _____________
   |             |    |             |   |             |
   | real-server |    | real-server |   | real-server |
   |_____________|    |_____________|   |_____________|

The problem:

A return packet from the real-server (with address RIP->CIP) will be sent to the real-server's default gw (the director). ICMP redirects will be sent from the director telling the real-server of the better route directly to the client. The real-server will then send the packet directly to the client and it will not be demasqueraded by the director. The client will get a reply from the RIP rather than the VIP and the connection will hang.

The cure:

In the previous HOWTO I said that initial attempts to handle this by turning off redirects had not worked. The problem appears now to be solved.

Thanks to michael_e_brown@dell.com and Julian ja@ssi.bg for help sorting this out.

To get a VS-NAT LVS to work on one network -

1. On the director, turn off icmp redirects

director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects

(Note: eth0 may be eth1 etc, on your machine).

2. Make the director the default and only route for outgoing packets.

You will probably have set the routing on the real-server up like this

realserver:/etc/lvs# netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0

Note the route to 192.168.1.0/24. This allows the real-server to send packets to the client by just putting them out on eth0, where the client will pick them up directly (without being demasqueraded) and the LVS will not work.

Remove the route to 192.168.1.0/24.

realserver:/etc/lvs#route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0

This will leave you with

realserver:/etc/lvs# netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0

The VS-NAT LVS now works. If LVS is forwarding telnet, you can telnet from the client to the VIP and connect to the real-server.

You can ping from the client to the real-server.

You can also connect _directly_ to services on the real-server _NOT_ being forwarded by LVS (in this case e.g. ftp).

You can no longer connect directly to the real-server for services being forwarded by the LVS. (In the example here, telnet ports are not being rewritten by the LVS, ie telnet->telnet).

client:~# telnet realserver
Trying 192.168.1.11...
^C

(i.e. connection hangs)

Here's tcpdump on the director. Since the network is switched the director can't see packets between the client and real-server. The client initiates telnet. `netstat -a` on the client shows a SYN_SENT from port 4121.


director:/etc/lvs# tcpdump
tcpdump: listening on eth0
16:37:04.655036 realserver.telnet > client.4121: S 354934654:354934654(0) ack 1183118745 win 32120 <mss 1460,sackOK,timestamp 111425176[|tcp]> (DF)
16:37:04.655284 director > realserver: icmp: client tcp port 4 121 unreachable [tos 0xc0]

(repeats every second until I kill telnet on client)

I don't see the connect request from client->real-server. The first packet I see is the ack from the real-server, which will be forwarded via the director. The director will rewrite the ack to be from the director. The client will not accept an ack to port 4121 from director:telnet.

11.9 Various debugging techniques for routes

(with Julian)

The routes added with the route command go into the kernel FIB (Forwarding information base) route table. The contents are displayed with the route (or netstat -a) command.

Following an icmp redirect, the route updates go into the kernel's route cache (route -C).

You can flush the route cache with

        echo 1 > /proc/sys/net/ipv4/route/flush
or
        ip route flush cache

Here's the route cache on the real-server before any packets are sent.

realserver:/etc/rc.d# route -C
Kernel IP routing cache
Source          Destination     Gateway         Flags Metric Ref    Use Iface
realserver      director        director              0      1        0 eth0
director        realserver      realserver      il    0      0        9 lo

With icmp redirects enabled on the director, repeatedly running traceroute to the client shows the routes changing from 2 hops to 1 hop. This indicates that the real-server has received an icmp redirect packet telling it of a better route to the client.

realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.932 ms  0.562 ms  0.503 ms
 2  client (192.168.1.254)  1.174 ms  0.597 ms  0.571 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.72 ms  0.581 ms  0.532 ms
 2  client (192.168.1.254)  0.845 ms  0.559 ms  0.5 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  client (192.168.1.254)  0.69 ms *  0.579 ms

Although the route command shows no change in the FIB, the route cache has changed. (The new route of interest is bracketted by >< signs.)

 realserver:/etc/rc.d# route -C
 Kernel IP routing cache
 Source          Destination     Gateway         Flags Metric Ref    Use Iface
 client          realserver      realserver      l     0      0        8 lo
 realserver      realserver      realserver      l     0      0     1038 lo
 realserver      director        director              0      1      138 eth0
>realserver      client          client                0      0        6 eth0<
 director        realserver      realserver      l     0      0        9 lo
 director        realserver      realserver      l     0      0      168 lo

Packets to the client now go directly to the client instead of via the director (which you don't want).

It takes about 10mins for the client's route cache to expire (experimental result). The timeouts may be in /proc/sys/net/ipv4/route/gc_*, but their location and values are well encrypted in the sources :) (some more info from Alexey at LVS archives )

Here's the route cache after 10mins.


realserver:/etc/rc.d# route -C
Kernel IP routing cache
Source          Destination     Gateway         Flags Metric Ref    Use Iface
realserver      realserver      realserver      l     0      0     1049 lo
realserver      director        director              0      1      139 eth0
director        realserver      realserver      l     0      0        0 lo
director        realserver      realserver      l     0      0      236 lo

There are no routes to the client anymore. Checking with traceroute, shows that 2 hops are initially required to get to the client (i.e. the routing cache has reverted to using the director as the route to the client). After 2 iterations, icmp redirects route the packets directly to the client again.


realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.908 ms  0.572 ms  0.537 ms
 2  client (192.168.1.254)  1.179 ms  0.6 ms  0.577 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.695 ms  0.552 ms  0.492 ms
 2  client (192.168.1.254)  0.804 ms  0.55 ms  0.502 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  client (192.168.1.254)  0.686 ms  0.533 ms *

If you now turn off icmp redirects on the director.


director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects

Checking routes on the real-server -


realserver:/etc/lvs# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0

nothing has changed here.

Flush the kernel routing table and show the kernel routing table -


realserver:/etc/lvs# ip route flush cache
realserver:/etc/lvs# route -C
Kernel IP routing cache
Source          Destination     Gateway         Flags Metric Ref    Use Iface
realserver      director        director              0      1        0 eth0
director        realserver      realserver      l     0      0        1 lo

There are no routes to the client.

Now when you send packet to the client, the route stays via the director needing 2 hops to get to the client. There are no one hop packets to the client.


realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.951 ms  0.56 ms  0.491 ms
 2  client (192.168.1.254)  0.76 ms  0.599 ms  0.574 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.696 ms  0.562 ms  0.583 ms
 2  client (192.168.1.254)  0.62 ms  0.603 ms  0.576 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.692 ms *  0.599 ms
 2  client (192.168.1.254)  0.667 ms  0.603 ms  0.579 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.689 ms  0.558 ms  0.487 ms
 2  client (192.168.1.254)  0.61 ms  0.63 ms  0.567 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.705 ms  0.563 ms  0.526 ms
 2  client (192.168.1.254)  0.611 ms  0.595 ms *
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.706 ms  0.558 ms  0.535 ms
 2  client (192.168.1.254)  0.614 ms  0.593 ms  0.573 ms

The kernel route cache


 realserver:/etc/rc.d# route -C
 Kernel IP routing cache
 Source          Destination     Gateway         Flags Metric Ref    Use Iface
 client          realserver      realserver      l     0      0       17 lo
 realserver      realserver      realserver      l     0      0        2 lo
 realserver      director        director              0      1        0 eth0
>realserver      client          director              0      0       35 eth0<
 director        realserver      realserver      l     0      0       16 lo
 director        realserver      realserver      l     0      0       63 lo

shows the the only route to the client (labelled with >< ) is via the director.

> For send_redirects, what's the difference between all, default and eth0?

from Julian

see the LVS archives

When the kernel needs to check for one feature (send_redirects for example) it uses such calls: if (IN_DEV_TX_REDIRECTS(in_dev)) ... These macros are defined in /usr/src/linux/include/linux/inetdevice.h The macro returns a value using expression from

all/<var> and <dev>/<var>

So, these macros check for example for:

all/send_redirects || eth0/send_redirects

all/hidden && eth0/hidden

when you create eth0 for first time using ifconfig eth0 ... up default/send_redirects is copied to eth0/send_redirects from the kernel, internally. I.e. default/ contains the initial values the device inherits when it is created. This is the safest way a device to appear with correct

conf/<dev>/

values.

When we put value in

all/<var>

you can assume that we set the

<var>

When we put value in

all/<var>

you can assume that we set the

<var>

for all devices in this way:

                all/<var>       the macro returns:
for &&          0               0
for &&          1               the value from <dev>/<var>
for ||          0               the value from <dev>/<var>
for ||          1               1

This scheme allows the different devices to have different values for their vars. For example, if we set 0 to all/send_redirects, the 3th line applies to the values, i.e. the result from the macro is the real value in <dev>/send_redirects. If we set 1 to all/send_redirects according to the 4th line, the macro always returns 1 regardless of the <dev>/send_redirects

> ...how to debug/understand TCP/IP packets?

The RFC documents are your friends:

http://www.ietf.cnri.reston.va.us/rfc.html

The numbers you need:

793     TRANSMISSION CONTROL PROTOCOL
1122    Requirements for Internet Hosts -- Communication Layers
1812    Requirements for IP Version 4 Routers
826     An Ethernet Address Resolution Protocol

for tcpdump, see man tcpdump.

from Steve.Gonczi@networkengines.com, for Micrsoft NT _server_, there is a uSoft supplied packet capture utility as well.

also -W. Richard Stevens: TCP-IP Illustrated, Vol 1, a good intro into packet layouts and protocol basics. (anything by Stevens is good - Joe).

From: Ivan Figueredo idf@weewannabe.com for windump - http://netgroup-serv.polito.it/windump/

Here's an untested solution from Julian for a one network VS-NAT

 >put the client in the external logical network. By this way the
 >client, the director and the real server(s) are on same physical network
 >but the client can't be on the masqueraded logical network. So, change the
 >client from 192.168.1.80 to 166.84.192.80 (or something else). Don't add
 >interface 192.168.1.0/24 in the client. The path to 192.168.1.0/24 must be
 >through DIP (I don't see such IP for the Director). Why in your setup
 >DIP==VIP ? If you add DIP (166.84.192.33 for example) in the director you
 >can later add path for 192.168.1.0/24 through 166.84.192.33. There is no
 >need to use masquerading with 2 NICs. Just remove the client from the
 >internal logical network used by the LVS cluster.

A different working solution from Ray Bellis rpb@community.net.uk

 >On my system the VIP and RIP aren't on the same *physical* network, just on
 >the same *logical* subnet.
 >I still have a dual-ethernet box acting as a director, and the VIP is
 >installed as an alias interface on the external side of the director, even
 >though the IP address it has is in fact assigned from the same subnet as the
 >machines on the internal side of the director.

11.10 Postings from the mailing list

Ray Bellis rpb@community.net.uk has used a 2 NIC director to have the RIPs on the same logical network as the VIP (ie RIP and VIP numbers are from the same subnet), although they are in different physical networks.

frederic.defferrard@ansf.alcatel.fr

would be possible to use LVS-NAT to load-balance virtual-IPs to ssh-forwarded real-IPs?

Ssh can also be used to create a local access that is forwarded to a remote access throught the ssh protocol. For example you can use ssh to securely map a local acces to a remote POP server:

local:localport ==> local:ssh ssh port forwarding remote:ssh ==> remote:pop

And when you connect to local:localip you are transparently/securely connected to remote:pop

The main idea is to allow RS in differents LANs with RS that are non-Linux (precluding VS-Tun).

 
 Example:
                                - VS:81 ---- ssh ---- RS:80
                               /
 INTERNET - - - - > VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80
                               \
                                - VS:83 ---- ssh ---- RS:80

(Wensong) you can use VPN (or CIPE) to map some external real servers into your private cluster network. If you use LVS-NAT, make sure the routing on the real server must be configuration properly so that the response packets will go through the load balancer to the clients.

 > I think that it isn't necessery to have the default router to the load
 > balancer when using ssh because when the RS address is the same that the
 > VS address (differents ports)
 >
 > >
 > > Wensong
 > >
 > > > Example:
 > > >                                - VS:81 ---- ssh ---- RS:80
 > > >                               /
 > > > INTERNET - - - - > VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80
 > > >                               \
 > > >                                - VS:83 ---- ssh ---- RS:80
 > > >
 > > > The main idea is to allow RS in differents LANs.
 > > >

With the NAT method, your example won't work because the LVS/NAT treats packets as local ones and forward to the upper layers without any change.

However, your example give me an idea that we can dynamically redirect the port 80 to port 81, 82 and 83 respectively for different connections, then your example can work. However, the performance won't be good, because lots of works are done in the application level, and the overhead of copying from kernel to user-space is high.

Another thought is that we might be able to setup LVS/DR with real server in different LANs by using of CIPE/VPN stuff. For example, we use CIPE to establish tunnels from the load balancer to real servers like

 
 
                     10.0.0.1================10.0.1.1 realserer1
                     10.0.0.2================10.0.1.2 realserer2
   --- Load Balancer 10.0.0.3================10.0.1.3 realserer3
                     10.0.0.4================10.0.1.4 realserer4
                     10.0.0.5================10.0.1.5 realserer5

Then, you can add LVS/DR configuration commands as:

         ipvsadm -A -t VIP:www
         ipvsadm -a -t VIP:www -r 10.0.1.1 -g
         ipvsadm -a -t VIP:www -r 10.0.1.2 -g
         ipvsadm -a -t VIP:www -r 10.0.1.3 -g
         ipvsadm -a -t VIP:www -r 10.0.1.4 -g
         ipvsadm -a -t VIP:www -r 10.0.1.5 -g

I haven't tested it. Please let me know the result if anyone tests this configuration.

Next Previous Contents

11. VS-NAT

11.1 Introduction