Skip to content

Latest commit

 

History

History
555 lines (395 loc) · 22.4 KB

README.adoc

File metadata and controls

555 lines (395 loc) · 22.4 KB

JGroups and Docker

Overview

This document describes how to use JGroups in Docker containers to form clusters locally, on Amazon Web Services (AWS) and in the Google cloud (Google Compute Platform, GCP).

The advantage of running Docker images in the cloud, rather than cloud-specific images is that the Docker images are the same across different clouds, whereas cloud images are always specific to a given cloud.

The predefined docker image we’ll use is https://hub.docker.com/r/belaban/jgroups. It contains a number of interactive Demos, ie. demos which require a TTY.

The difference between running locally, or running in a cloud lies in the JGroups configuration and the Docker startup options. Clouds generally do not support IP multicasting, so JGroups applications have to resort to TCP rather than UDP as transport. Plus, PING cannot be used and has to be replaced with different discovery protocols, e.g. S3_PING or GOOGLE_PING.

Bridged and host network

Docker supports none, bridge and host networks out of the box (https://docs.docker.com/engine/userguide/networking). Which type of network is used is defined by the --network option, e.g. --network=host.

Network none has no communication to the outside world and is used purely for Docker containers running on the same box. The bind address used is usually 127.0.0.1 (loopback) here.

Bridge network

Network bridge is the default and creates a virtual interface in the container that’s not available on the host. To be able to communicate with other Docker containers on other hosts, the host’s IP address has to be used (e.g. using external_addr=<host’s address>, or by switching to --network=host (see below).

For instance, starting an AWS EC2 instance, ifconfig executed on the host shows the following interfaces:

[ec2-user@ip-172-31-14-130 ~]$ ifconfig
docker0   Link encap:Ethernet  HWaddr 02:42:33:C7:8E:F1
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:33ff:fec7:8ef1/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:22 errors:0 dropped:0 overruns:0 frame:0
          TX packets:20 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1342 (1.3 KiB)  TX bytes:1492 (1.4 KiB)

eth0      Link encap:Ethernet  HWaddr 0E:C8:A1:0F:39:E6
          inet addr:172.31.14.130  Bcast:172.31.15.255  Mask:255.255.240.0
          inet6 addr: fe80::cc8:a1ff:fe0f:39e6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:105050 errors:0 dropped:0 overruns:0 frame:0
          TX packets:37021 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:117948327 (112.4 MiB)  TX bytes:6929186 (6.6 MiB)
...

The docker0 interface is created by Docker to implement the bridged network. This is the interface that will be used by the Docker container. Note that eth0 is the host’s interface, with a routable 172.31.14.130 IP address.

When starting a container with bridged networking (default):

docker run -it -p 7800:7800 --rm belaban/jgroups, ifconfig executed in the container shows:

bash-4.3$ ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02
          inet addr:172.17.0.2  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:acff:fe11:2%32571/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:508 (508.0 B)  TX bytes:258 (258.0 B)
...

The 172.17.0.2 address is assigned by Docker from the docker0 interface.

The issue with these Docker-private addresses is that they cannot be used to talk between AWS instances, as they are not routed. For instances to talk to each other, the host’s eth0 has to be used (from the 172.31.0.0 subnet).

To do this, JGroups has an attribute external_addr in the transport. In a configuration, the following transport snippet would enable JGroups applications in network-bridged Docker containers to communicate:

<TCP external_addr="${ext-addr:172.31.14.130}"
     bind_addr="match-interface:eth0"
  ...
 />

This means that JGroups will bind to eth0 (172.17.0.2), but advertize its address as 172.31.14.130, so other instances can talk to the instance.

How to find out the hosts address is implementation-dependent; in AWS curl http://169.254.169.254/latest/meta-data/local-ipv4 returns the IP address of the EC2 instance.

An alternative is to grab the IP address before starting the Docker container and pass it to the container as an environment variable (e.g. -e EXT-ADDR=1.2.3.4). The container’s entrypoint could then start JGroups by passing system property -Dext-addr=1.2.3.4 which will set external_addr in TCP.

Host network

The host network is started as follows:

docker run --network=host -it -p 7800:7800 --rm belaban/jgroups.

This means that the started container has access to the same interfaces as present on the host. This can be confirmed by executing ifconfig inside the container:

bash-4.3$ ifconfig
docker0   Link encap:Ethernet  HWaddr 02:42:14:CB:41:B6
          inet addr:172.17.0.1  Bcast:172.17.255.255  Mask:255.255.0.0
          inet6 addr: fe80::42:14ff:fecb:41b6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:224 (224.0 B)  TX bytes:1068 (1.0 KiB)

eth0      Link encap:Ethernet  HWaddr 02:50:00:00:00:01
          inet addr:192.168.65.3  Bcast:192.168.65.255  Mask:255.255.255.0
          inet6 addr: fe80::50:ff:fe00:1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:37 errors:0 dropped:0 overruns:0 frame:0
          TX packets:47 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3510 (3.4 KiB)  TX bytes:3556 (3.4 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:140 (140.0 B)  TX bytes:140 (140.0 B)

veth4ce4999 Link encap:Ethernet  HWaddr 56:5B:F4:6B:D1:20
          inet6 addr: fe80::545b:f4ff:fe6b:d120/64 Scope:Link
          inet6 addr: fe80::545b:f4ff:fe6b:d120/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:1618 (1.5 KiB)
...

As can be seen, eth0 has an address from the hosts address range. This means that the JGroups configuration doesn’t require an external_addr attribute and can simply define bind_addr:

<TCP
     bind_addr="match-interface:eth0"
  ...
 />

This will bind the transport’s sockets to 172.31.14.130.

Bridged versus host networking

It is simpler to use host networks as they don’t require NATing between addresses (no external_addr). However, in some cases, this may not be available, when automatically launching containers via AWS Beanstalk.

Running containers locally

The JGroups demos can be run as multiple containers forming a cluster on the same local box (in bridged or host network mode), or across boxes in the local network (in host network).

To run containers locally, the configuration used by JGroups uses IP multicasting, as shown in ./conf/udp.xml (abridged):

<config>
    <UDP
       bind_addr="match-interface:eth0,match-interface:en0,site_local,loopback"
    />

    <PING />
    <MERGE3 max_interval="30000" min_interval="10000"/>
    <FD_SOCK/>
    <FD_ALL timeout="10000" interval="3000"/>
    <pbcast.NAKACK2/>
    <UNICAST3 />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="8m"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <UFC max_credits="2M" min_threshold="0.4"/>
    <MFC max_credits="2M" min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
</config>

This configuration uses IP multicasting (UDP as transport) and multicast discovery (PING as discovery protocol). It binds to eth0 if found, if not tries to bind to en0 (Macs), then tries to find a site local IP address, and falls back to lookback (127.0.0.1) if all preceding addresses didn’t match.

Run multiple container as follows:

docker run -it --rm --network=host belaban/jgroups

In the container, there’s a readme which describes how to run the demos, e.g. Chat. Also see Demos below for details on the demos. To run the Chat demo, run

# udp.xml is the default so -props can be omitted
chat.sh -props udp.xml -name A

This should create a cluster of Chat nodes, even across hosts due to --network=host.

If IP multicasting is not supported, one can always fall back to TCP, but this means copying tcp.xml (for example) from the JGroups distribution into the Docker container and modifying it (e.g. change the discovery protocol). The JGroups manual at http://www.jgroups.org/manual4/index.html shows how to do this.

Amazon Web Services (AWS)

This section shows how to run Docker containers with the JGroups demo on AWS. Contrary to running locally, we have to use a security policy to define which traffic (TCP/UDP/ICMP) to send/receive on which ports.

Note
Alternative ways of running JGroups in AWS include (1) creating/customizing an EC2 image and then running a host with the image directly (without docker), or (2) using EC2 Container Service (ECS), which runs docker images on a number of EC2 instances.
The reason (1) is not discussed here is lazyness :-) I did not want to go through the rather long turnaround times of iteratively creating and customizing an AMI. Instead, creating and customizing a docker image and running/testing it locally made for much faster turnaround times!
Running on ECS (option 2) is similar to what’s described here, except that there’s no need to spin up EC2 instances manually, as this is done by ECS when defining a cluster.

Since IP multicasting is not supported, we also have to switch to TCP as transport and also change the discovery protocol (see Discovery below).

Finally, if we use S3 for discovery, this requires an IAM role assigned to the container that permits creation, deletion, reading and writing of S3 buckets. These topics are discussed below.

The Amazon image used for running the demo contains a bare bones Linux plus the Docker software: ami-92659c84. Search for amazon-ecs-optimized to find it. Of course, any other AMI that contains Docker can be used.

First, an EC2 instance is created with this AMI, then we need to ssh into it and start the Docker container.

Security policy

The security policy used for the demo permits all traffic from any protocol on any port and from/to any address. This is fine for a demo, but of course not recommended for production.

Alternatively, using no security group is the same as above.

The selected security policy needs to be associated with the EC2 instance when it is started.

IAM security role to access S3

If S3 (e.g. S3_PING or NATIVE_S3_PING) is used for discovery in the JGroups configuration, then the EC2 instance needs to be started with a role that permits access to S3 buckets, specifically creation of buckets, reading objects from buckets and writing objects to buckets (deletion is currently not used).

The role that I used for the demo was AmazonS3FullAccess which (as its name suggests) has all permissions regarding S3 buckets.

Discovery

The task of discovery is for a new cluster node to find the coordinator of the cluster to join, and send it a join request. Whereas PING sends a multicast and everyone responds with the coordinator’s address and information about themselves, this cannot be done in a cloud where IP multicasting usually isn’t supported.

Therefore, we have to resort to other ways of running discovery. They’re discussed in the following sections.

NATIVE_S3_PING

NATIVE_S3_PING is a separate protocol developed at https://github.com/jgroups-extras/native-s3-ping. It uses the Amazon S3 Java SDK to access buckets in S3 which store information about cluster members.

The advantage over S3_PING is that no credentials (AWS_ACCESS_KEY, AWS_SECRET_ACCESS_KEY) needs to be passed to the EC2 instance on startup, but rather the credentials of the user which started the EC2 instance are used.

There’s a configuration ./conf/aws.xml which includes this protocol:

<config>
    <TCP
         external_addr="${JGROUPS_EXTERNAL_ADDR:match-interface:eth0}"
	     bind_addr="site_local,match-interface:eth0"
         bind_port="${TCP_PORT:7800}"
    />
    <!--
      Uses an S3 bucket to discover members in the cluster.
      - If "mybucket" doesn't exist, it will be created (requires permissions)
    -->
    <org.jgroups.aws.s3.NATIVE_S3_PING
        region_name="${S3_REGION:us-east-1}"
        bucket_name="${S3_BUCKET:mybucket}"
     />
    <MERGE3 max_interval="30000" min_interval="10000"/>
    <FD_SOCK external_addr="${JGROUPS_EXTERNAL_ADDR}"
             start_port="${FD_SOCK_PORT:9000}"/>
    <FD_ALL timeout="10000" interval="3000"/>
    <pbcast.NAKACK2/>
    <UNICAST3/>
    <pbcast.STABLE desired_avg_gossip="50000"
                   max_bytes="8m"/>
    <pbcast.GMS print_local_addr="true" join_timeout="3000"
                view_bundling="true"/>
    <UFC max_credits="2M" min_threshold="0.4"/>
    <MFC max_credits="2M" min_threshold="0.4"/>
    <FRAG2 frag_size="60K"  />
</config>

Attributes bind_addr and external_addr were discussed above. Note that the latter is not required if the Docker container is started with --network=host.

The attributes used by NATIVE_S3_PING are region_name and bucket_name. The latter defines the bucket that will be used to store information about the members of this cluster. All objects created in this bucket are prefixed by the cluster name ("draw" in the example), e.g. mybucket/draw-126532-A.list. The former defines the region in which the bucket is located.

Note
If a bucket doesn’t exist, a new one will be created. Since bucket names have a global name space, a bucket that already exists for a different user will throw an exception. It is therefore recommended to create a bucket up-front and use it as bucket_name.

Both region and bucket can be overridden by passing system properties S3_REGION and S3_BUCKET_NAME to the JGroups demo. Similar to passing ext-addr (see above), the docker container could be started with two environment variables for region and bucket name and they could then be read from the environment by a script that passes them as env vars to the demo.

S3_PING

S3_PING is discussed in the JGroups manual at http://www.jgroups.org/manual4/index.html#_s3_ping.

TCPGOSSIP and GossipRouter

TCPGOSSIP is a discovery protocol which stores information about cluster nodes in one or more GossipRouters, which are separate processes acting as lookup services. The configuration of TCPGOSSIP needs to include the addresses of the GossipRouters, e.g.:

<TCP .../>
<TCPGOSSIP
    initial_hosts="GR1[12001],GR2[12001]"
 />

This means that there are GossipRouters running on GR1 and GR2 at port 12001, and TCPGOSSIP will register each member with both processes.

The GossipRouter processes can be started inside Docker container, too, e.g. using the image available at https://hub.docker.com/r/jboss/jgroups-gossip.

TCPPING

TCPPING lists all cluster nodes in a static list, e.g.

<TCP bind_port="7800" .../>
<TCPPING
    initial_hosts="HostA[7800],HostB[7800],HostC[7800],..."
 />

The problem here is that the IP addresses of HostA, HostB, etc have to be known before starting any of the EC2 instances. This requires either fixed addresses, or addresses mapped to a DNS service, e.g. Amazon Route 53.

An alternative is to use AWS Elastic IP addresses, where the address of each cluster node is fixed and all of the potential members of the cluster are known ahead of starting the cluster.

Another alternative is to create a Virtual Private Network (VPC), e.g. 172.45.0.0 and assign addresses from block 172.45.0.1 - 172.45.0.100. This means that TCPPING.initial_hosts needs to have all 100 members of this block listed.

Elastic File System (EFS) and FILE_PING

Another way to run discovery on AWS is Elastic File System (EFS). EFS is a distributed file system, accessible by all cluster nodes. Any creation of modification of a file by a cluster member is visible by every other member.

FILE_PING can therefore be used for discovery; the location attribute needs to point to a directory mounted by EFS. Reads and writes are performed by EFS.

JDBC_PING / RDS

JDBC_PING uses a table in a database for discovery. In conjunction with RDS, JDBC_PING can be setup to store information in an RDS table.

Running the Docker containers on EC2

After spinning up an EC2 instance, we’re now ready to run the Docker container with image belaban/jgroups.

The JGroups configuration in the Docker image is ./conf/aws.xml, listed in NATIVE_S3_PING. It requires ports 7800 (used by TCP) and 9000 (used by FD_SOCK) to be published; therefore the command to start the container is:

docker run -it --rm --network=host -p 7800:7800 -p 9000:9000 belaban/jgroups
  • -it: starts an interactive shell (TTY) so we can interact with (e.g.) the Chat demo

  • --rm: removes the container when done

  • --network=host: picks a host network. If omitted, a bridge network would be picked by default

  • -p 7800:7800 -p 9000:9000: publishes ports 7800 to 7800 and 9000 to 9000.

  • belaban/jgroups: the Docker images hosted on dockerhub.com

Note
When using a host network, publishing the ports is not necessary, therefore the -p 7800:7800 -p 9000:9000 option can be omitted.

When the Docker container has been started, the entrypoint (/bin/bash) shows a readme which explains how to start the demos. To for example run Chat, the following command has to be executed:

chat.sh -props aws.xml -name A -b bucket-name
  • -props aws.xml: this instructs Chat to use aws.xml (described above) as configuration

  • -name A: gives each cluster node a unique name. If omitted, a random name will be picked

  • -b bucket-name: the name of the S3 bucket. Will be created if it doesn’t exist. Note that an exception will be thrown if that name is already in use by a different project.

Google Compute Platform (GCP)

There is a separate project jgroups-google which uses Google Cloud Storage for discovery, and provides GOOGLE_PING2 as discovery protocol.

It is meant to be used with Google Compute Engine nodes that are started manually. To run JGroups in Google Container Engine (GKE), which uses Kubernetes under the covers, we recommend to use KUBE_PING instead.

Refer to http://belaban.blogspot.ch/2017/05/running-infinispan-cluster-with.html for step-by-step instructions on how to run a JGroups cluster on GKE.

Demos

To run the image directly, execute

docker run -it --rm --network=host belaban/jgroups

or

docker run -p 7800:7800 -p 9000:9000 -it --rm belaban/jgroups

To build the image, run

docker build .

The demos are described below. The idea is to run the demo apps in a container each on the same host and they will form a cluster.

Chat

To run it:

chat [-props config] [-name name] [-b bucketname], e.g. chat -props ./udp.xml -name A

Run the Chat application in multiple containers on the same host and they will form a cluster. Typing a message into one Chat will send it to all other chats

Distributed locks

Distributed locks are implementations of java.util.concurrent.locks.Lock and provide locks that can be accessed from all nodes in a cluster.

A typical use case is to lock a resource so that only 1 thread in a given node in the cluster can access it. Should a node crash while holding a lock, the lock is released immediately.

For more details, see the section on distributed locks at http://www.jgroups.org/manual/index.html#LockService.

To run the lock demo, type:

lock [-name  name]

Typing help into the shell shows a few commands:

[jgroups@b21d0fa6c79d ~]$ lock -name A
-------------------------------------------------------------------
GMS: address=A, cluster=lock-cluster, physical address=172.17.0.178:52519
-------------------------------------------------------------------
: help
LockServiceDemo [-props properties] [-name name]
Valid commands:
    lock (<lock name>)+
    unlock (<lock name> | "ALL")+
    trylock (<lock name>)+ [<timeout>]
Example:
    lock lock lock2 lock3
    unlock all
    trylock bela michelle 300
:

If you start instances A and B, you can try out the following:

  1. A: lock printer

  2. B: lock printer // will block

  3. A: unlock printer // now B will get the lock

Or a lock holder can be killed:

  1. A: lock printer

  2. B: lock printer

  3. Kill A. B will now get the lock on "printer"

Distributed counters

Distributed counters are counters will can be atomically incremented, decremented, compare-and-set etc across a cluster.

To run the demo:

count [-name name]

Run multiple instances in different containers. The demo uses a counter named "mycounter" and there’s a command prompt which shows the commands to be executed.

Questions can be asked on the users or dev mailing lists: https://sourceforge.net/p/javagroups/mailman.

Enjoy !

Bela Ban