Let's go ahead and implement a random search algorithm with PyTorch:

- Import the Gym and PyTorch packages and create an environment instance:

**>>> import gym**

>>> import torch

>>> env = gym.make('CartPole-v0')

- Obtain the dimensions of the observation and action space:

**>>> n_state = env.observation_space.shape[0]**

>>> n_state

4

>>> n_action = env.action_space.n

>>> n_action

2

These will be used when we define the tensor for the weight matrix, which is size 4 x 2 in size.

- Define a function that simulates an episode given the input weight and returns the total reward:

** >>> def run_episode(env, weight):**

... state = env.reset()

... total_reward = 0

... is_done = False

... while not is_done:

... state = torch.from_numpy(state).float()

... action = torch.argmax(torch.matmul(state, weight))

... state, reward, is_done, _ = env.step(action.item())

... total_reward += reward

... return total_reward

Here, we convert the state array to a tensor of the float type because we need to compute the multiplication of the state and weight tensor, `torch.matmul(state, weight)`, for linear mapping. The action with the higher value is selected using the `torch.argmax()` operation. And don't forget to take the value of the resulting action tensor using `.item()` because it is a one-element tensor.

- Specify the number of episodes:

**>>> n_episode = 1000**

- We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, we specify their starting values:

**>>> best_total_reward = 0**

>>> best_weight = None

We will also record the total reward for every episode:

**>>> total_rewards = []**

- Now, we can run
`n_episode`. For each episode, we do the following:

- Randomly pick the weight
- Let the agent take actions according to the linear mapping
- An episode terminates and returns the total reward
- Update the best total reward and the best weight if necessary
- Also, keep a record of the total reward

Put this into code as follows:

** >>> for episode in range(n_episode):**

... weight = torch.rand(n_state, n_action)

... total_reward = run_episode(env, weight)

... print('Episode {}: {}'.format(episode+1, total_reward))

... if total_reward > best_total_reward:

... best_weight = weight

... best_total_reward = total_reward

... total_rewards.append(total_reward)

...

Episode 1: 10.0

Episode 2: 73.0

Episode 3: 86.0

Episode 4: 10.0

Episode 5: 11.0

……

……

Episode 996: 200.0

Episode 997: 11.0

Episode 998: 200.0

Episode 999: 200.0

Episode 1000: 9.0

We have obtained the best policy through 1,000 random searches. The best policy is parameterized by `best_weight`.

- Before we test out the best policy in the testing episodes, we can calculate the average total reward achieved by random linear mapping:

** >>> print('Average total reward over {} episode: {}'.format(**

n_episode, sum(total_rewards) / n_episode))

Average total reward over 1000 episode: 47.197

This is more than twice what we got from the random action policy (22.25).

- Now, let's see how the learned policy performs on 100 new episodes:

** >>> n_episode_eval = 100**

>>> total_rewards_eval = []

>>> for episode in range(n_episode_eval):

... total_reward = run_episode(env, best_weight)

... print('Episode {}: {}'.format(episode+1, total_reward))

... total_rewards_eval.append(total_reward)

...

Episode 1: 200.0

Episode 2: 200.0

Episode 3: 200.0

Episode 4: 200.0

Episode 5: 200.0

……

……

Episode 96: 200.0

Episode 97: 188.0

Episode 98: 200.0

Episode 99: 200.0

Episode 100: 200.0

>>> print('Average total reward over {} episode: {}'.format(

n_episode, sum(total_rewards_eval) / n_episode_eval))

Average total reward over 1000 episode: 196.72

Surprisingly, the average reward for the testing episodes is close to the maximum of 200 steps with the learned policy. Be aware that this value may vary a lot. It could be anywhere from 160 to 200.