Friday Facts 16: UpDog (Unity+ML)

Unity has a Machine Learning component. Can I teach it to roll over and stand up?

My robots always start as a simulation.  Instead of hand-rolling a physics engine and a machine learning system I decided this time to try to teach a quadruped bot to a gait strategy (aka get up and walk) using Unity mlagents which use Tensorflow.  My scene and all related material is here:

What is Machine Learning?

Machine Learning is a magic box. On one side you feed it some inputs and then give/withold rewards until the other side produces the result you want. The first trick is to design the inputs well. The second is to design the reward well. There may be other tricks after that but I haven’t found them yet. Part of writing this is document what I’m trying. Maybe you have ideas about how to do it better? Talk with me.

The quadruped model

The root Environment lets me put everything in a Prefab which I can then instance to run many bots in parallel. Dog contains Torso which is the root of the skeleton. Each bone of the skeleton is an ArticulatedBody because they most closely resemble the behavior I’d expect to feed to my servos – set desired angle value, cross fingers. 

Each joint has one degree of rotational freedom and some reasonable limits.  I have no idea what joint stiffness would mirror real life.

Each joint contains one Model that has the mesh of the bot, the material, etc.

On the end of each leg is a foot which is ignored by physics. I paint them red when I detect contact with the terrain.

Behavior Parameters set the inputs and outputs of the network. Space Size is the number of inputs. Stacked Vectors stacks recent inputs so that the network has a kind of memory. Actions are the outputs, and Continuous Actions are decimal numbers usually in the range -1…1. Behavior Type is where you can set drive-by-ML or heuristic (aka drive it yourself).

Dog Controller is the script that is collects the Observations – it feeds the ML with sensory input (more on that later). Dog Controller also receives instructions from the network and then applies them – in this case, it sets the Drive Targets of each ArticulatedBody, same as I would do with a servo on a quadruped. Lastly, Dog Controller usually also gives a Reward to the network to teach it.

Decision Requester sets how often the network should make a decision. The value here is in Academy Steps, which isn’t Unity Update ticks, it isn’t frames, it isn’t FixedUpdate physics steps, it’s… something else.


At the start of each episode (learning cycle) I have to reset the position of the quadruped. I’ve found the easiest way is to make a preserve the original Torso and instead use a completely new clone of the entire Torso hierarchy.

Calculating Rewards

I have now experimented with many automatic styles of reward. Mostly my thinking is a variation on these themes:

  • The reward function first gives points for “uprightness” (read torso.transform.up.y approaching 1).
  • Add to that height off ground (max height 3.6, so height/3.6 will never be more than 1).
  • Add to that horizontal movement (maximum XZ speed of 10, then divided by 10, never more than 1).

These three elements are further scaled by a values that can be tweaked from Dog Controller.

I have also tried a manual voting style where the user (me) can click on bots that appear to be doing well to feed them a reward. I also added a button to end all episodes and start over, in case they’re all terrible. Machine Learning Eugenics :S

Results so far

nine bots running in parallel while my CPU gently weeps

So far it has managed to roll over a few times but never figured out how to stand up. It doesn’t even stay rolled over.

Things I know I don’t know


I have no idea if my network is shaped/configured well. Currently I’m using three layers with 128 neurons each.

I have no idea when rewards are applied to the network, so I’m not sure when is best to manually feed them. I don’t know if I should reward continuously while it is running or do as in the codemonkey example (reward and immediately reset).

ArticulatedBody is not great to work with.  GetJointPositions is in radians and SetJointTargets is in degrees? Physically I have no idea what stiffness to set for ArticulatedBody joints. Too high and they can launch the quad off the ground, too low and they don’t have the strength to roll the dog over.

Setting a good Joint Speed in the Dog Controller is also part of the problem. High stiffness and low joint speed (under 0.5 degree per step)

Why do I have to restart tensorboard results graphing script every time I restart the mlagent-learn training tool? my parameters haven’t changed, only the results that are being graphed.

Final thoughts

I’d really like to talk to some passionate and experienced people about training this successfully. You can find me on Discord or Github all the time. If you have specific comments or questions about this article, let me know and I’ll try to address them in an edit.

Robot Arm

Can Machine Learning Improve Robot Kinematics?

I’ve tried several times to hand-code inverse kinematics for robot industrial arms in Robot Overlord.  To make a long story short, there are a lot of complicated edge cases that break my brain.  Many modern methods involve a lot of guess work in the path planning.  I know that a well trained Machine Learning agent could do the job much better, but to date there are none I can download and install in my robot.  So I’m going to try and do it myself.  Join me!

The problem I’m trying to solve with Machine Learning

I have a 3D model of my arm in the Java app called Robot Overlord.  The 3D model is fully posable.  At any given pose I can calculate the angle of every joint and the exact location and orientation of the finger tip.  I have Forward Kinematics (FK) which is a tool to translate joint angles into finger tip.  I have Inverse Kinematics (IK) which is a tool to translate the reverse.

A robot arm is programmed with a series of poses.  Go to pose A, close the gripper, Go to pose B, insert the part, Go to pose C, etc… The robot software has to calculate the movement of the arm between poses and then adjust every motor simultaneously to drive the finger tip along the path between the two poses.  I’ve already solved the firmware part to drive six motors given sets of joint angles.

The problem is that one IK solution there might be many combinations of joint angles – sometimes infinite solutions.  To illustrate this, hold a finger on the table and move your elbow.  Your finger tip didn’t move and you had lots of possible wrist/shoulder changes.  As the arm moves through space it can cross a singularity – one of the spots with infinite solutions – and when it comes out the other side the hand-written solution flips the some or all of the arm 180 degrees around.  A smarter system would have recognized the problem and (for instance) kept the elbow to the side.  I have tried to write better IK code but have not had any success.

My Machine Learning Plan of Attack

My plan is to use a Deep Learning Neural Network.  The DNN is a bit of a black box: on one side there is a layer of Inputs, on the other there is a layer of Outputs, and in between there are one or more hidden layers.  Inputs filter through the layers and come out as Outputs.  The magic is in the filtration process!  The filter can be trained with gradient descent using a cost function – if I can score every input/output combination I can let the DNN play with the virtual arm while the cost function watches and says “good, bad, better, worse” until the two work out all the best possible movements.

My Machine Learning Network design

I believe my inputs should be:

  • Arm starting pose: 6 random angle values.  Because DNN inputs are values in the range 0…1 I’ll say 0 is 0 degrees and 1 is 360 degrees.
  • Arm ending pose: 3 random position values and 3 random angle values.  Position values are scaled over the total movement range of the robot.  So if the robot can move on the X axis from -50 to +50, (x/100+0.5) would give a value 0…1.
  • Interpolation between both poses: 1 decimal number.  0 means at the start pose and 1 means at the end pose.

I want my outputs to be:

  • Arm joint angles: 6 angle values.
  • confidence flag: 1 number.  0 means “I definitely can’t reach that pose” and 1 means “I can reach that pose”.

The cost function should work in two steps:

  1. make sure there is no error in the joint value – that is to say, the finger is actually on the path where it should be, and
  2. seek to reduce joint acceleration.  Adjust the elbow and the wrist ahead of time to avoid the need to suddenly twist.

I’m going to try first with two hidden layers, then experiment from there.  I intuitively guess it will take at least two layers because there are two parts to the cost function.

My Machine Learning code setup

Robot Overlord source code is already written in Java so I’ve added TensorFlow and DL4J.  Currently I’m still walking through the MNIST quickstart tutorials and asking the DL4J chat room for help.  They already solved a few head scratching differences between the DL4J quickstart tutorials and the DL4J up-to-date examples.  You can find my first test in Robot Overlord’s code at /src/main/java/com/marginallyclever/robotOverlord/


I hope that I’ve described my challenge thoroughly.  Please feel free to look at the code and make pull requests, or comment below or in the forums with any tips and advice you might have.  If you’re feeling helpful but not sure how, please share this far and wide so that I can reach the people who have the DNN know-how.

Stay awesome!