top of page

Python | Deep Reinforcement with OpenAI Gym & TensorFlow

Updated: Oct 30

Learn how robots learn! Deep reinforcement is a subfield of machine learning that focuses on how to make decisions through trial-and-error and reward-based feedback.

Before diving into this topic, we'll first need to define an important term, just in case we're covering content you're unfamiliar with:

Software Agents: These are programs that perform actions to achieve a particular goal. All agents are programs, but not all programs are agents.

Python programmer using a software agent

Commonly-used agents include: Internet search systems, e-mail inboxes, shopping bots, form auto-fillers, and chatbots. In a game, an agent might use search algorithms to explore different moves and evaluate their outcomes. In a robotic application, an agent might use sensors and machine learning to figure out how to navigate the environment and perform tasks. Key qualities of agents include that they are:

  • Reacting to their Environment

  • Autonomous

  • Goal-Oriented

  • Persistent

An agent can be implemented using various approaches. These include rule-based systems, search algorithms, and the kind we'll be looking at today: machine learning techniques like reinforcement learning.

With this technique, an agent is trained to perform a task by interacting with its environment and receiving feedback in the form of rewards. The agent then uses this feedback to learn a policy that maximizes its rewards over time. It's like the saying:

Practice Makes Perfect

The agent not only learns how to perform a task, but how to improve its performance for the best result. It practices different methods while searching for the best solution.

A great visual example of this is learning how to walk. A teacher can only warn the student about so many variables in the terrain; The student must learn from its trips, fumbles, and falls to walk better next time.

Meet Cassie: a bipedal robot at Berkley who taught herself to walk utilizing machine reinforcement learning. In the video, Hybrid Robotics provides a useful chart for understanding the reinforcement learning cycle.

Deep reinforcement learning has shown great promise in solving complex problems in areas such as robotics, game playing, and autonomous driving.

How does a Python programmer write code that teaches an agent to do something new?

Two of the most popular Python libraries for implementing this technique are OpenAI Gym and TensorFlow. OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms, while TensorFlow is an open-source software library for machine learning and AI. Let's take a closer look at what that means and how these libraries can be used together to implement deep reinforcement learning:

OpenAI Gym : The Practice Space

OpenAI Gym provides a wide range of environments for testing reinforcement learning algorithms. These environments simulate a variety of tasks, such as controlling a robot arm, playing a game of Atari, or navigating a maze. Each environment provides:

  • An observation of the current state

  • A set of possible actions

  • A reward for each action

  • A way to transition to the next state

Again, the agent learns to perform the task by interacting with the environment and receiving feedback in the form of rewards.

One environment is the The CartPole environment, which consists of a pole moving along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. Here is an example of how to use OpenAI Gym to simulate a CartPole environment:

pythonCopy code
import gym

env = gym.make('CartPole-v0')
obs = env.reset()

for t in range(100):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)

    if done:


In this code snippet, we create an instance of the CartPole-v0 environment and reset it to the initial state. We then run a loop for 100 time steps, during which we choose a random action from the set of possible actions, perform the action in the environment, and render the result. We continue until the task is completed (done=True), or until the loop terminates.

The challenge presented to the software agent is to keep the cart pole balanced, even as these random movements pull it towards the ground.

TL/DR: OpenAi Gym provides a digital practice space (like a gym) to place a software agent and teach it to teach itself. Now, all we need is the class curriculum.

TensorFlow: The Lesson

TensorFlow is a powerful library for building and training deep neural networks. In the context of reinforcement learning, TensorFlow can be used to implement the agent's policy, which maps observations to actions. This policy can be represented by a neural network that takes the current state as input and outputs a probability distribution over the possible actions.

Here is an example of how to use TensorFlow to build a simple neural network for the CartPole environment:

pythonCopy code
import tensorflow as tf
import numpy as np

obs_dim = env.observation_space.shape[0]
n_actions = env.action_space.n

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu', input_shape=(obs_dim,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(n_actions, activation='softmax')

obs = env.reset()
obs = np.reshape(obs, (1, obs_dim))
action_probs = model.predict(obs)
action = np.random.choice(np.arange(n_actions), p=action_probs[0])

In this code snippet, we define a neural network with two hidden layers of size 32 and an output layer with softmax activation that outputs a probability distribution over the possible actions. We then create an instance of the CartPole-v0 environment and obtain an observation of the current state. We reshape the observation to match the input shape of the neural network and use it to obtain a probability distribution over the actions. We then choose an action randomly from this distribution.

Combining OpenAI Gym and TensorFlow

To combine OpenAI Gym and TensorFlow, we can use TensorFlow to implement the agent's policy and OpenAI Gym to provide the environment. To put a picture to it: We create a practice room, place our student in it, and tell it what the objective is.

In the CartPole environment, we're challenging the software agent to balance the pole even when the cart is randomly moved back and forth on the track. The software agent tries again and again, learning what happens when it makes a wrong move, until it learns the best way to keep the pole balanced. Success!

Software Agents Learning to Balance

Looking to add a Python Programmer to your team?

If this doesn't sound like your field of expertise, hire someone who can help! Software consultancies like BearPeak Technology Group have expert developers for hire who can do all of these tasks for you. Check us out! We're a Boulder, Colorado-based team of engineers who help you hire remote software developers efficiently and reliably. We offer free consultations and are dedicated to your startup's success:


It's important for us to disclose the multiple authors of this blog post: The original outline was written by chat.openai, an exciting new AI language model. The content was then edited and revised by Lindey Hoak.

"OpenAI (2023). ChatGPT. Retrieved from"

bottom of page