運行中的ipvs
ipvs 的規(guī)則實現(xiàn)原理
ipvs的規(guī)則是如何生效的,先來看看他實現(xiàn)的原理
簡單的來講,ipvs無非就是修改了數(shù)據(jù)報頭信息來完成client -> virus server -> real server的調(diào)度.調(diào)度的目的是使realservers之間的負(fù)載接近于平衡狀態(tài).這里牽扯到2個問題,修改數(shù)據(jù)報的方式和調(diào)度的策略.
我們先來看看修改數(shù)據(jù)報的具體方式,現(xiàn)在2.6內(nèi)核中ipvs實現(xiàn)的方式和原來有點不一樣.引用一下ipvs的作者張文嵩先生的一段話
我們分別在Linux 內(nèi)核2.0和內(nèi)核2.2中修改了TCP/IP協(xié)議棧,在IP層截取和改寫/轉(zhuǎn)發(fā)IP報文,實現(xiàn)了三種IP負(fù)載均衡技術(shù),并提供了一個ipvsadm程序進行虛擬服務(wù)器的配置和管理。在Linux內(nèi)核2.4和2.6中,我們把它實現(xiàn)為NetFilter的一個模塊,很多代碼作了改寫和進一步優(yōu)化,目前版本已在網(wǎng)上發(fā)布,根據(jù)反饋信息該版本已經(jīng)較穩(wěn)定。
好吧,說得很清楚了,ipvs就是借用netfilter來修改數(shù)據(jù)報的.那么簡單了解一下netfilter的工作原理還是很有必要的,如圖
netfilter一共有5個規(guī)則鏈,每個規(guī)則鏈都能存放若干條規(guī)則,規(guī)則之間都順序(也就是優(yōu)先級),一旦有規(guī)則被匹配到,完成相應(yīng)動作后,跳出該規(guī)則鏈.這5個規(guī)則鏈分別是PREROUTING,INPUT,FORWARD,OUTPUT,POSTROUTING.我們可以將機器中的連接分成3中狀態(tài)
從外部進入主機的連接,經(jīng)過 PREROUTING -> INPUT 從主機出去的連接,將經(jīng)過 OUPUT -> POSTROUTING 由主機轉(zhuǎn)發(fā)的連接,經(jīng)過PREROUTING -> FORWARD -> POSTROUTING
每個規(guī)則鏈里的規(guī)則會在數(shù)據(jù)經(jīng)過該規(guī)則鏈的時候起作用(也就是調(diào)用相應(yīng)的函數(shù)進行處理).看上去很簡單吧,比如ipvs作為netfilter的一個模塊,往這些規(guī)則鏈里寫入規(guī)則就好可以了
等等.如果netfilter有很多模塊,都往一個規(guī)則鏈里寫入規(guī)則,會不會很亂呢?優(yōu)先級如何控制呢?所以規(guī)則鏈里的規(guī)則我們會根據(jù)不同的作用將其分類進行管理,每一類的規(guī)則用一個整數(shù)來表示他的優(yōu)先級,越小,優(yōu)先級越高.如果是同一類型的規(guī)則,則根據(jù)規(guī)則的先后順序來決定(鏈表結(jié)構(gòu),越靠前,優(yōu)先級越高)
netfilter本身有3個作用,所以他的規(guī)則分為3種類型,用3個表來表示,分別為filter表(過濾),nat表(修改數(shù)據(jù)報頭),mangle表(修改數(shù)據(jù)).而ipvs模塊就相當(dāng)于在netfilter里添加了一張新的ipvs表一樣.關(guān)于netfilter的更多信息,請參考
文獻一ipvs 的規(guī)則實現(xiàn)過程
每當(dāng)有新的連接(數(shù)據(jù)報)經(jīng)過netfilter的規(guī)則鏈時,就會調(diào)用NF_HOOK()函數(shù).此函數(shù)會訪問一個全部變量nf_hooks.這個變量里存放了netfilter的所有表(包括filter,nat,mangle和ipvs附加表等),以及每個表的規(guī)則鏈,規(guī)則鏈里的函數(shù)調(diào)用.然后遍歷nf_hooks變量里相應(yīng)規(guī)則鏈里的所有信息,根據(jù)優(yōu)先級進行相應(yīng)的函數(shù)調(diào)用,每個規(guī)則鏈里的函數(shù)都會根據(jù)該規(guī)則鏈里的規(guī)則對數(shù)據(jù)報進行匹配和處理
還記得在前一部分的最后,講到的nf_register_hook()部分嗎?正是ipvs使用ret = nf_register_hooks(ip_vs_ops, ARRAY_SIZE(ip_vs_ops)); 往nf_hooks變量里加入了一些數(shù)據(jù),才使得ipvs的規(guī)則能被netfilter執(zhí)行.接下來我們來看看加入的都是些什么數(shù)據(jù)
ip_vs_ops的數(shù)據(jù)內(nèi)容是
net/ipv4/ipvs/ip_vs_core.c
static struct nf_hook_ops ip_vs_ops[] __read_mostly = { /* After packet filtering, forward packet through VS/DR, VS/TUN, * or VS/NAT(change destination), so that filtering rules can be * applied to IPVS. */ { .hook = ip_vs_in, //調(diào)用的函數(shù)名稱,也就是說只要有數(shù)據(jù)經(jīng)過INPUT規(guī)則鏈,就會調(diào)用ip_vs_in()對數(shù)據(jù)進行匹配和處理 .owner = THIS_MODULE, //模塊的名稱 .pf = PF_INET, //協(xié)議族的名稱,一般都是ip(PF_INET)協(xié)議 .hooknum = NF_INET_LOCAL_IN, //規(guī)則鏈的代號,為INPUT .priority = 100, //優(yōu)先級 }, /* After packet filtering, change source only for VS/NAT */ { .hook = ip_vs_out, //對經(jīng)過FORWARD的數(shù)據(jù)調(diào)用ip_vs_out()進行處理 .owner = THIS_MODULE, .pf = PF_INET, .hooknum = NF_INET_FORWARD, .priority = 100, }, /* After packet filtering (but before ip_vs_out_icmp), catch icmp * destined for 0.0.0.0/0, which is for incoming IPVS connections */ { .hook = ip_vs_forward_icmp, //對經(jīng)過FORWARD的數(shù)據(jù)調(diào)用ip_vs_forward_icmp()進行處理 .owner = THIS_MODULE, .pf = PF_INET, .hooknum = NF_INET_FORWARD, .priority = 99, }, /* Before the netfilter connection tracking, exit from POST_ROUTING */ { .hook = ip_vs_post_routing, //對經(jīng)過POSTROUTING的數(shù)據(jù)調(diào)用ip_vs_post_routing()進行處理 .owner = THIS_MODULE, .pf = PF_INET, .hooknum = NF_INET_POST_ROUTING, .priority = NF_IP_PRI_NAT_SRC-1, }, };
可以看到,ipvs一共在INPUT,FORWARD,POSTROUTING這3個規(guī)則鏈里一共添加了4個處理的函數(shù).接下來一個一個來分析
ip_vs_in()ip_vs_out()ip_vs_forward_icmp()ip_vs_post_routing()ip_vs_in()
ip_vs_in()被放置在INPUT規(guī)則鏈里,會檢查進入本機的所有數(shù)據(jù)報.作用是將訪問vs(虛擬服務(wù)器)的連接轉(zhuǎn)給rs(真實服務(wù)器),達到負(fù)載均衡的目的,如何調(diào)度與配置時的調(diào)度算法相關(guān).如何修改數(shù)據(jù)報頭部與VS的類型相關(guān),VS有3種類型
VS/NAT會修改s_addr, d_addr, d_port(可能) VS/DR會修改d_addr, d_port(可能) VS/TUN直接在原來數(shù)據(jù)報的基礎(chǔ)上加一個新的包頭,也叫封裝
在這個函數(shù)中,對所有目的地址為本機(調(diào)度服務(wù)器)的數(shù)據(jù)進行了處理,從skb(sk_buff)中提出連接的協(xié)議結(jié)構(gòu)pp(ip_vs_protocol),找出哪些skb(sk_buff)符合虛擬服務(wù)的規(guī)則svc(ip_vs_service),并找到與之對應(yīng)的cp(ip_vs_conn),如果沒有找到就new一個cp,并將其加入到ip_vs_conn_tab列表中).最后根據(jù)cp->packet_xmit()的方法對數(shù)據(jù)進行傳送.當(dāng)然,有很多的參數(shù)需要更新,比如連接的狀態(tài),pp,cp,skb的計數(shù)器等等...
net/ipv4/ipvs/ip_vs_core.c
/* * Check if it's for virtual services, look it up, * and send it on its way... */ //這里翻譯一下,檢查數(shù)據(jù)報是否是發(fā)往vs(虛擬服務(wù)器)的,如果是,將其轉(zhuǎn)發(fā)到它該去的地方... static unsigned int ip_vs_in(unsigned int hooknum, struct sk_buff *skb, const struct net_device *in, const struct net_device *out, int (*okfn)(struct sk_buff *)) //hooknum是規(guī)則鏈代號;*skb是數(shù)據(jù)報頭部;*in記錄了數(shù)據(jù)報從哪個網(wǎng)絡(luò)設(shè)備進來;*out記錄了數(shù)據(jù)報將會從哪個網(wǎng)絡(luò)設(shè)備出去(如果知道的話); *okfn()是一個處理sk_buff指針的函數(shù)指針,基本上沒用到 { struct iphdr *iph; struct ip_vs_protocol *pp; struct ip_vs_conn *cp; int ret, restart; int ihl; /* * Big tappo: only PACKET_HOST (neither loopback nor mcasts) * ... don't know why 1st test DOES NOT include 2nd (?) */ if (unlikely(skb->pkt_type != PACKET_HOST //如果數(shù)據(jù)不是給本地網(wǎng)絡(luò)(我們/PACKET_HOST)的 || skb->dev->flags & IFF_LOOPBACK || skb->sk)) { //或者是給lo設(shè)備的,或者是一個sock已經(jīng)建立好的連接(應(yīng)該是指本機已存在的真實連接吧) IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n", skb->pkt_type, ip_hdr(skb)->protocol, NIPQUAD(ip_hdr(skb)->daddr)); //調(diào)用IP_VS_DBG做下記錄 return NF_ACCEPT; //立刻返回NF_ACCEPT(意味著繼續(xù)下一個hook函數(shù)) } //而作為一個vs機器,以上情況是很少發(fā)生的,所以用到了unlikely這樣的gcc預(yù)編譯函數(shù).以加快執(zhí)行速度 iph = ip_hdr(skb); //得到ip層頭部信息 if (unlikely(iph->protocol == IPPROTO_ICMP)) { //如果數(shù)據(jù)報是icmp協(xié)議 int related, verdict = ip_vs_in_icmp(skb, &related, hooknum); //用ip_vs_in_icmp()進行處理 if (related) //如果是相關(guān)聯(lián)的連接 return verdict; //用ip_vs_in_icmp()返回的值退出 iph = ip_hdr(skb); //否則得到skb的網(wǎng)絡(luò)層頭部指針(ip_hdr()使用的是偏移量得到的指針位置) } /* Protocol supported? */ pp = ip_vs_proto_get(iph->protocol); //如果是ipvs不認(rèn)識的協(xié)議,pass掉 if (unlikely(!pp)) return NF_ACCEPT; ihl = iph->ihl << 2; //iph->ihl是以4byte為一個單位,所以要做一個轉(zhuǎn)換 /* * Check if the packet belongs to an existing connection entry */ cp = pp->conn_in_get(skb, pp, iph, ihl, 0); //該連接是否已存在,cp為連接狀態(tài) if (unlikely(!cp)) { //如果在ip_vs_conn_tab中找不到該連接(也就是該連接是第一次訪問vs的話) int v; if (!pp->conn_schedule(skb, pp, &v, &cp)) //利用該協(xié)議定義的conn_schedule函數(shù)為skb選擇合適的rs,并根據(jù)skb,pp生成一個新的cp.并將cp添加到ip_vs_conn_tab中.rs的選擇請查看相應(yīng)協(xié)議的conn_schedule函數(shù),比如tcp_conn_schedule() return v; //添加失敗時,返回錯誤碼 } if (unlikely(!cp)) { //不可知的異常,輸出debug信息后,退出 /* sorry, all this trouble for a no-hit :) */ IP_VS_DBG_PKT(12, pp, skb, 0, "packet continues traversal as normal"); return NF_ACCEPT; } IP_VS_DBG_PKT(11, pp, skb, 0, "Incoming packet"); /* Check the server status */ if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) { //如果目標(biāo)地址不可用 /* the destination server is not available */ if (sysctl_ip_vs_expire_nodest_conn) { //讓cp立刻超時 /* try to expire the connection immediately */ ip_vs_conn_expire_now(cp); } /* don't restart its timer, and silently drop the packet. */ __ip_vs_conn_put(cp); //cp計數(shù)器-1 return NF_DROP; } ip_vs_in_stats(cp, skb); //更新cp,skb的計數(shù)器(連接數(shù)和數(shù)據(jù)量) restart = ip_vs_set_state(cp, IP_VS_DIR_INPUT, skb, pp); //更新skb連接在IP_VS_DIR_INPUT位置的狀態(tài) if (cp->packet_xmit) //調(diào)用cp的packet_xmit()將數(shù)據(jù)傳送出去,函數(shù)是在建立cp的時候,由ip_vs_bind_xmit(cp),根據(jù)dest->flags(真實服務(wù)器的標(biāo)記)來決定的,有5種方法ip_vs_nat_xmit,ip_vs_tunnel_xmit,ip_vs_dr_xmit,ip_vs_null_xmit,ip_vs_bypass_xmit ret = cp->packet_xmit(skb, cp, pp); /* do not touch skb anymore */ else { IP_VS_DBG_RL("warning: packet_xmit is null"); ret = NF_ACCEPT; } /* Increase its packet counter and check if it is needed * to be synchronized * * Sync connection if it is about to close to * encorage the standby servers to update the connections timeout */ atomic_inc(&cp->in_pkts); //計數(shù)器 if ((ip_vs_sync_state & IP_VS_STATE_MASTER) && (((cp->protocol != IPPROTO_TCP || cp->state == IP_VS_TCP_S_ESTABLISHED) && (atomic_read(&cp->in_pkts) % sysctl_ip_vs_sync_threshold[1] == sysctl_ip_vs_sync_threshold[0])) || ((cp->protocol == IPPROTO_TCP) && (cp->old_state != cp->state) && ((cp->state == IP_VS_TCP_S_FIN_WAIT) || (cp->state == IP_VS_TCP_S_CLOSE))))) ip_vs_sync_conn(cp); //將ip_vs_conn的信息添加到sync_buff,可用于vs(調(diào)度服務(wù)器)之間的信息同步 cp->old_state = cp->state; ip_vs_conn_put(cp); //釋放cp return ret; }
ip_vs_out()
此函數(shù)放在FORWARD規(guī)則鏈上,經(jīng)過本機進行轉(zhuǎn)發(fā)的skb都會被該函數(shù)處理.在vs/nat模式下,內(nèi)網(wǎng)的rs返回給client的數(shù)據(jù)會經(jīng)網(wǎng)關(guān)(本機)轉(zhuǎn)發(fā),這個時候需要修改數(shù)據(jù)報的源地址,將其修改為網(wǎng)關(guān)的公網(wǎng)ip地址,這樣才能使連接持續(xù)下去,否則client將無法訪問到rs(內(nèi)網(wǎng)地址)
net/ipv4/ipvs/ip_vs_core.c
/* * It is hooked at the NF_INET_FORWARD chain, used only for VS/NAT. * Check if outgoing packet belongs to the established ip_vs_conn, * rewrite addresses of the packet and send it on its way... */ static unsigned int ip_vs_out(unsigned int hooknum, struct sk_buff *skb, const struct net_device *in, const struct net_device *out, int (*okfn)(struct sk_buff *)) { struct iphdr *iph; struct ip_vs_protocol *pp; struct ip_vs_conn *cp; int ihl; EnterFunction(11); //debug if (skb->ipvs_property) //如果已經(jīng)被ipvs修改過,直接pass return NF_ACCEPT; iph = ip_hdr(skb); //得到skb的網(wǎng)絡(luò)層頭部信息起始指針 if (unlikely(iph->protocol == IPPROTO_ICMP)) { //如果是icmp協(xié)議的數(shù)據(jù) int related, verdict = ip_vs_out_icmp(skb, &related); //用ip_vs_out_icmp處理 if (related) //如果是相關(guān)聯(lián)的連接 return verdict; //返回verdict iph = ip_hdr(skb); //否則再次得到iph(ip層頭部指針)***為什么又運行一次呢? } pp = ip_vs_proto_get(iph->protocol); //得到ipvs的ip_vs_proto結(jié)構(gòu)pp if (unlikely(!pp)) //如果是ipvs不支持的協(xié)議,pass掉 return NF_ACCEPT; /* reassemble IP fragments */ if (unlikely(iph->frag_off & htons(IP_MF|IP_OFFSET) && //如果skb是一個分片 !pp->dont_defrag)) { if (ip_vs_gather_frags(skb, IP_DEFRAG_VS_OUT)) //則重組以后,標(biāo)記為NF_STOLEN返回,防止netfilter對其再次操作 return NF_STOLEN; iph = ip_hdr(skb); //如果重組失敗,再次得到iph.***重復(fù)3次了 } ihl = iph->ihl << 2; //轉(zhuǎn)成byte為長度單位,默認(rèn)為4byte /* * Check if the packet belongs to an existing entry */ cp = pp->conn_out_get(skb, pp, iph, ihl, 0); //檢查skb是否是ip_vs_conn_tab中某個連接(client -> rs)的相關(guān)連接(rs -> client),如果是,則返回cp(ip_vs_conn),如果不是,cp為NULL if (unlikely(!cp)) { //如果cp不存在 if (sysctl_ip_vs_nat_icmp_send && //sysctl_ip_vs_nat_icmp_send值為0,后面的代碼貌似不會繼續(xù)執(zhí)行了,這部分代碼估計是debug用的 (pp->protocol == IPPROTO_TCP || //skb為tcp協(xié)議或者udp協(xié)議 pp->protocol == IPPROTO_UDP)) { __be16 _ports[2], *pptr; pptr = skb_header_pointer(skb, ihl, //得到skb端口信息 sizeof(_ports), _ports); if (pptr == NULL) //如果沒端口,pass return NF_ACCEPT; /* Not for me */ if (ip_vs_lookup_real_service(iph->protocol, //通過協(xié)議/源地址/源端口去尋找是否是內(nèi)網(wǎng)的某個rs發(fā)出的tcp/udp數(shù)據(jù)報 iph->saddr, pptr[0])) { /* * Notify the real server: there is no * existing entry if it is not RST * packet or not TCP packet. */ if (iph->protocol != IPPROTO_TCP //考慮到由內(nèi)網(wǎng)(rs)通過本機轉(zhuǎn)發(fā)到外網(wǎng)(client)的數(shù)據(jù),不可能是不是tcp或者不是rst包,否則發(fā)出一個icmp出錯報文,目的地址不可達.然后丟棄skb || !is_tcp_reset(skb)) { icmp_send(skb,ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); return NF_DROP; } } } IP_VS_DBG_PKT(12, pp, skb, 0, "packet continues traversal as normal"); return NF_ACCEPT; //pass掉從內(nèi)網(wǎng)(realserver)發(fā)出的到外網(wǎng)的新連接(因為不與ip_vs_conn_tab中的連接相關(guān)聯(lián)) } IP_VS_DBG_PKT(11, pp, skb, 0, "Outgoing packet"); //debug if (!skb_make_writable(skb, ihl)) //如果skb的頭部不可寫入,跳到drop處 goto drop; /* mangle the packet */ if (pp->snat_handler && !pp->snat_handler(skb, pp, cp)) //到這里的數(shù)據(jù)就是需要修改源地址的(rs -> client)從內(nèi)網(wǎng)到外網(wǎng)的數(shù)據(jù)報了 goto drop; //如果定義了snat_handler,但是snat_handler()失敗,跳到drop處 ip_hdr(skb)->saddr = cp->vaddr; //將源地址轉(zhuǎn)化為虛擬服務(wù)器的地址,讓這個到外網(wǎng)的數(shù)據(jù)報看上去就像是從vs發(fā)出的一樣 ip_send_check(ip_hdr(skb)); //改動了源地址,就要重新計算校驗和 /* For policy routing, packets originating from this * machine itself may be routed differently to packets * passing through. We want this packet to be routed as * if it came from this machine itself. So re-compute * the routing information. */ if (ip_route_me_harder(skb, RTN_LOCAL) != 0) //為了讓skb看上去就像是本機發(fā)出的,還需要刷新路由信息 goto drop; IP_VS_DBG_PKT(10, pp, skb, 0, "After SNAT"); //debug ip_vs_out_stats(cp, skb); //更新cp,skb的計數(shù)器(連接數(shù),通訊量) ip_vs_set_state(cp, IP_VS_DIR_OUTPUT, skb, pp); //更新cp,skb,pp的狀態(tài)參數(shù),標(biāo)記等 ip_vs_conn_put(cp); //釋放cp計數(shù) skb->ipvs_property = 1; //打上標(biāo)記,以免再被ipvs修改 LeaveFunction(11); //debug return NF_ACCEPT; //pass drop: ip_vs_conn_put(cp); //釋放cp計數(shù) kfree_skb(skb); //釋放skb空間 return NF_STOLEN; //返回NF_STOLEN,避免netfilter再次修改 }
ip_vs_forward_icmp()
該函數(shù)和前面講到的ip_vs_out()在同一個FORWARD規(guī)則鏈上,但是的優(yōu)先級為99,比ip_vs_out()的100要小(高),所以優(yōu)先執(zhí)行.
函數(shù)非常簡單,就是將經(jīng)過FORWARD規(guī)則鏈的所有icmp數(shù)據(jù)報交給ip_vs_in_icmp()處理.為什么進入本機的數(shù)據(jù)會到FORWARD規(guī)則鏈上呢,原因在于local配置成透明設(shè)備時,tcp/udp協(xié)議是比較容易將forward的數(shù)據(jù)讓它input的,而icmp則沒有那么簡單了,所以有一些發(fā)往本機的icmp報文會跑到forward規(guī)則鏈上來(具體原因不明),所以在這里把漏掉的進入vs的icmp交給ip_vs_forward_icmp()處理
net/ipv4/ipvs/ip_vs_core.c
/* * It is hooked at the NF_INET_FORWARD chain, in order to catch ICMP * related packets destined for 0.0.0.0/0. * When fwmark-based virtual service is used, such as transparent * cache cluster, TCP packets can be marked and routed to ip_vs_in, * but ICMP destined for 0.0.0.0/0 cannot not be easily marked and * sent to ip_vs_in_icmp. So, catch them at the NF_INET_FORWARD chain * and send them to ip_vs_in_icmp. */ static unsigned int ip_vs_forward_icmp(unsigned int hooknum, struct sk_buff *skb, const struct net_device *in, const struct net_device *out, int (*okfn)(struct sk_buff *)) { int r; if (ip_hdr(skb)->protocol != IPPROTO_ICMP) //如果不是icmp,直接pass return NF_ACCEPT; return ip_vs_in_icmp(skb, &r, hooknum); //如果是.處理之 }
ip_vs_post_routing()
此函數(shù)的優(yōu)先級為NF_IP_PRI_NAT_SRC-1,比POSTROUTING上的nat,mangle的優(yōu)先級都高,保證了早于他們執(zhí)行,目的就是防止被ipvs修改過的數(shù)據(jù)報再次被netfilter修改.具體做法如下
net/ipv4/ipvs/ip_vs_core.c
/* * It is hooked before NF_IP_PRI_NAT_SRC at the NF_INET_POST_ROUTING * chain, and is used for VS/NAT. * It detects packets for VS/NAT connections and sends the packets * immediately. This can avoid that iptable_nat mangles the packets * for VS/NAT. */ static unsigned int ip_vs_post_routing(unsigned int hooknum, struct sk_buff *skb, const struct net_device *in, const struct net_device *out, int (*okfn)(struct sk_buff *)) { if (!skb->ipvs_property) //如果skb沒有ipvs修改過的記號,則pass,讓netfilter繼續(xù)處理去 return NF_ACCEPT; /* The packet was sent from IPVS, exit this chain */ return NF_STOP; //否則,用NF_STOP返回,netfilter受到這個信號以后,直接退出該規(guī)則鏈,不再做任何處理 }