-
Notifications
You must be signed in to change notification settings - Fork 28
Added multi cartpole example #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
veds12
wants to merge
20
commits into
aidudezzz:dev
Choose a base branch
from
eellak:dev
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
a33519c
added multi-cartpole grid env
veds12 7853e1a
training successful
veds12 c257e55
cleaned controllers
veds12 fb20c28
moved multi-cartpole to examples
veds12 4203cbf
deepbots update
veds12 893c5a1
update comm scheme
veds12 4155e44
updateed multi-cartpole
veds12 cbf2f95
dummy commit
veds12 70748f2
fixed multicartpole
veds12 e11ac61
Merge pull request #1 from veds12/dev
veds12 776b8be
Merge branch 'dev' of https://github.com/aidudezzz/deepworlds into dev
veds12 4cf7e4e
Merge branch 'dev' of https://github.com/eellak/gsoc2021-deepbots-evo…
veds12 0ba9f62
added training gif
veds12 6371bfc
updated multi-cartpole
veds12 192a097
Merge branch 'dev' of https://github.com/aidudezzz/deepworlds into dev
veds12 72f9432
added a group node for boundary boxes
veds12 4eebdd0
added plotting at the end of training
veds12 cac8d71
added readme & removed trash
veds12 1c69920
updated coordinate system in the controller
veds12 3e9453f
Merge branch 'dev' of https://github.com/aidudezzz/deepworlds into dev
veds12 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # Multi Cartpole | ||
|
|
||
| This example is a modified version of the classic [Cartpole](https://www.gymlibrary.ml/environments/classic_control/cart_pole/) problem in Reinforcement Learning in [Webots](https://cyberbotics.com/).It builds upon the | ||
|
|
||
| ## Environment Details | ||
|
|
||
| * The environment consists of of 9 cartpole robots placed in a 3D grid world. | ||
| * Each robot works independently of the others. | ||
| * A reward of +1 is received for each time step **all** carts manage to keep their poles upright | ||
| * An episode ends when any one of the carts looses the balance of it's pole | ||
| * The environment is solved when the average of 100 most recent episodes exceeds 195.0 | ||
|
|
||
| ## Implementation Details | ||
|
|
||
| * This example is a depiction of how the [Emitter-Receiver scheme](https://github.com/aidudezzz/deepbots#emitter---receiver-scheme) can be used to communicate between a central controller and multiple robots. | ||
| * The included solution uses a PPO policy with both the actor and the critic shared across all the agents. | ||
|
|
||
| ## Contents | ||
|
|
||
| * [Controllers](./controllers/) | ||
| * [Worlds](./worlds/) | ||
|
|
||
| ## Graphics | ||
|
|
||
| Training using PPO (shared actor and shared critic) | ||
|
|
||
|  |
124 changes: 124 additions & 0 deletions
124
examples/multi_cartpole/controllers/robot_controller/robot_controller.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| from deepbots.robots.controllers.robot_emitter_receiver_csv import RobotEmitterReceiverCSV | ||
|
|
||
|
|
||
| class CartPoleRobot(RobotEmitterReceiverCSV): | ||
| """ | ||
| CartPole robot has 4 wheels and pole connected by an unactuated hinge to its body. | ||
| The hinge contains a Position Sensor device to measure the angle from vertical needed in the observation. | ||
| Hinge: https://cyberbotics.com/doc/reference/hingejoint | ||
| Position Sensor: https://cyberbotics.com/doc/reference/positionsensor | ||
| """ | ||
|
|
||
| def __init__(self): | ||
| """ | ||
| The constructor gets the Position Sensor reference and enables it and also initializes the wheels. | ||
| """ | ||
| super().__init__() | ||
|
|
||
| self.robot_num = int(self.robot.getName()[-1]) | ||
| self.timestep = int(self.robot.getBasicTimeStep()) | ||
|
|
||
| self.emitter, self.reciever = self.initialize_comms() | ||
| self.emitter.setChannel(self.robot_num) | ||
| self.reciever.setChannel(self.robot_num) | ||
|
|
||
| self.positionSensor = self.robot.getDevice("polePosSensor") | ||
| self.positionSensor.enable(self.timestep) | ||
|
|
||
| self.wheels = [None for _ in range(4)] | ||
| self.setup_motors() | ||
|
|
||
| def initialize_comms(self, emitter_name="emitter", receiver_name="receiver"): | ||
| emitter = self.robot.getDevice(emitter_name) | ||
| receiver = self.robot.getDevice(receiver_name) | ||
| receiver.enable(self.timestep) | ||
|
|
||
| return emitter, receiver | ||
|
|
||
| def setup_motors(self): | ||
| """ | ||
| This method initializes the four wheels, storing the references inside a list and setting the starting | ||
| positions and velocities. | ||
| """ | ||
| self.wheels[0] = self.robot.getDevice('wheel1') | ||
| self.wheels[1] = self.robot.getDevice('wheel2') | ||
| self.wheels[2] = self.robot.getDevice('wheel3') | ||
| self.wheels[3] = self.robot.getDevice('wheel4') | ||
| for i in range(len(self.wheels)): | ||
| self.wheels[i].setPosition(float('inf')) | ||
| self.wheels[i].setVelocity(0.0) | ||
|
|
||
| def create_message(self): | ||
| """ | ||
| This method packs the robot's observation into a list of strings to be sent to the supervisor. | ||
| The message contains only the Position Sensor value, ie. the angle from vertical position in radians. | ||
| From Webots documentation: | ||
| 'The getValue function returns the most recent value measured by the specified position sensor. Depending on | ||
| the type, it will return a value in radians (angular position sensor) or in meters (linear position sensor).' | ||
|
|
||
| :return: A list of strings with the robot's observations. | ||
| :rtype: list | ||
| """ | ||
| message = [self.positionSensor.getValue()] | ||
| return message | ||
|
|
||
| def handle_receiver(self): | ||
| """ | ||
| Modified handle_receiver from the basic implementation of deepbots. | ||
| This one consumes all available messages in the queue during the step it is called. | ||
| """ | ||
| while self.receiver.getQueueLength() > 0: | ||
| # Receive and decode message from supervisor | ||
| message = self.receiver.getData().decode("utf-8") | ||
| # Convert string message into a list | ||
| message = message.split(",") | ||
|
|
||
| self.use_message_data(message) | ||
|
|
||
| self.receiver.nextPacket() | ||
|
|
||
| def use_message_data(self, message): | ||
| """ | ||
| This method unpacks the supervisor's message, which contains the next action to be executed by the robot. | ||
| In this case it contains an integer denoting the action, either 0 or 1, with 0 being forward and | ||
| 1 being backward movement. The corresponding motorSpeed value is applied to the wheels. | ||
|
|
||
| :param message: The message the supervisor sent containing the next action. | ||
| :type message: list of strings | ||
| """ | ||
| action = int(message[0]) | ||
|
|
||
| assert action == 0 or action == 1, "CartPoleRobot controller got incorrect action value: " + str(action) | ||
|
|
||
| if action == 0: | ||
| motorSpeed = 5.0 | ||
| else: | ||
| motorSpeed = -5.0 | ||
|
|
||
| for i in range(len(self.wheels)): | ||
| self.wheels[i].setPosition(float('inf')) | ||
| self.wheels[i].setVelocity(motorSpeed) | ||
|
|
||
| def handle_emitter(self): | ||
| data = self.create_message() | ||
| string_message = "" | ||
|
|
||
| if type(data) is list: | ||
| string_message = ",".join(map(str, data)) | ||
| elif type(data) is str: | ||
| string_message = data | ||
| else: | ||
| raise TypeError( | ||
| "message must be either a comma-separater string or a 1D list") | ||
|
|
||
| string_message = string_message.encode("utf-8") | ||
| self.emitter.send(string_message) | ||
|
|
||
| def run(self): | ||
| while self.robot.step(self.timestep) != 1: | ||
| self.handle_receiver() | ||
| self.handle_emitter() | ||
|
|
||
| # Create the robot controller object and run it | ||
| robot_controller = CartPoleRobot() | ||
| robot_controller.run() |
93 changes: 93 additions & 0 deletions
93
examples/multi_cartpole/controllers/supervisor_manager/PPO_runner.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| from numpy import convolve, ones, mean | ||
|
|
||
| from supervisor_controller import CartPoleSupervisor | ||
| from agent.PPO_agent import PPOAgent, Transition | ||
| from utilities import plotData | ||
|
|
||
| #Change these variables if needed | ||
| EPISODE_LIMIT = 2000 | ||
| STEPS_PER_EPISODE = 200 | ||
| NUM_ROBOTS = 9 | ||
|
|
||
| def run(): | ||
| # Initialize supervisor object | ||
| supervisorEnv = CartPoleSupervisor() | ||
|
|
||
| episodeCount = 0 | ||
|
|
||
| # The agent used here is trained with the PPO algorithm (https://arxiv.org/abs/1707.06347). | ||
| agent = PPOAgent(supervisorEnv.observationSpace, supervisorEnv.actionSpace) | ||
|
|
||
| solved = False # Whether the solved requirement is met | ||
|
|
||
| # Run outer loop until the episodes limit is reached or the task is solved | ||
| while not solved and episodeCount < EPISODE_LIMIT: | ||
| state = supervisorEnv.reset() # Reset robots and get starting observation) | ||
| supervisorEnv.episodeScore = 0 | ||
| actionProbs = [[] for i in range(NUM_ROBOTS)] | ||
|
|
||
| # Inner loop is the episode loop | ||
| for step in range(STEPS_PER_EPISODE): | ||
| # In training mode the agent samples from the probability distribution, naturally implementing exploration | ||
| selectedActions = [] | ||
| for i in range(NUM_ROBOTS): | ||
| selectedAction, actionProb = agent.work(state[i], type_="selectAction") | ||
| actionProbs[i].append(actionProb) | ||
| selectedActions.append(selectedAction) | ||
|
|
||
| # Step the supervisor to get the current selectedAction reward, the new state and whether we reached the | ||
| # done condition | ||
| newState, reward, done, info = supervisorEnv.step([*selectedActions]) | ||
|
|
||
| # Save the current state transitions from all robots in agent's memory | ||
| for i in range(NUM_ROBOTS): | ||
| trans = Transition(state[i], selectedActions[i], actionProbs[i][-1], reward, newState[i]) | ||
| agent.storeTransition(trans) | ||
|
|
||
| supervisorEnv.episodeScore += reward # Accumulate episode reward | ||
| if done: | ||
| # Save the episode's score | ||
| supervisorEnv.episodeScoreList.append(supervisorEnv.episodeScore) | ||
| agent.trainStep(batchSize=step + 1) | ||
| solved = supervisorEnv.solved() # Check whether the task is solved | ||
| break | ||
|
|
||
| state = newState # state for next step is current step's newState | ||
|
|
||
| avgActionProb = [round(mean(actionProbs[i]), 4) for i in range(NUM_ROBOTS)] | ||
|
|
||
| # The average action probability tells us how confident the agent was of its actions. | ||
| # By looking at this we can check whether the agent is converging to a certain policy. | ||
| print(f"Episode: {episodeCount} Score = {supervisorEnv.episodeScore} | Average Action Probabilities = {avgActionProb}") | ||
|
|
||
| episodeCount += 1 # Increment episode counter | ||
|
|
||
| movingAvgN = 10 | ||
| plotData(convolve(supervisorEnv.episodeScoreList, ones((movingAvgN,))/movingAvgN, mode='valid'), | ||
| "episode", "episode score", "Episode scores over episodes") | ||
|
|
||
| if not solved and not supervisorEnv.test: | ||
| print("Reached episode limit and task was not solved.") | ||
| else: | ||
| if not solved: | ||
| print("Task is not solved, deploying agent for testing...") | ||
| elif solved: | ||
| print("Task is solved, deploying agent for testing...") | ||
|
|
||
| state = supervisorEnv.reset() | ||
| supervisorEnv.test = True | ||
| supervisorEnv.episodeScore = 0 | ||
| while True: | ||
| selectedActions = [] | ||
| for i in range(NUM_ROBOTS): | ||
| selectedAction, actionProb = agent.work(state[i], type_="selectAction") | ||
| actionProbs[i].append(actionProb) | ||
| selectedActions.append(selectedAction) | ||
|
|
||
| state, reward, done, _ = supervisorEnv.step(selectedActions) | ||
| supervisorEnv.episodeScore += reward # Accumulate episode reward | ||
|
|
||
| if done: | ||
| print("Reward accumulated =", supervisorEnv.episodeScore) | ||
| supervisorEnv.episodeScore = 0 | ||
| state = supervisorEnv.reset() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.