I've been messing around with Q-learning adapted with NN, after I read these two articles: I'm not yet ready to understand and implement conv NN so I just fooled around with normal NN. I've been told to use sigmoid as activation function and cross-entropy as cost function. The problem is it doesn't seem to work well with Q-learning since I want my output to be a real number, using a probability output seem like a bad hack to me. The papers I read seem to use the quadratic cost function but I have no detail about the activation function. I checked the github of someone who implemented all these and he seem to not use any activation function at all.

Deep Learning (DL) models are revolutionizing the business and technology world with jaw-dropping performances in one application area after another -- image classification, object detection, object tracking, pose recognition, video analytics, synthetic picture generation -- just to name a few. However, they are like anything but classical Machine Learning (ML) algorithms/techniques. DL models use millions of parameters and create extremely complex and highly nonlinear internal representations of the images or datasets that are fed to these models. Whereas for the classical ML, domain experts and data scientists often have to write hand-crafted algorithms to extract and represent high-dimensional features from the raw data, deep learning models, on the other hand, automatically extracts and work on these complex features. A lot of theory and mathematical machines behind the classical ML (regression, support vector machines, etc.) were developed with linear models in mind.

It takes the inputs, multiplied by the weights for each neuron, and creates an output signal proportional to the input. In one sense, a linear function is better than a step function because it allows multiple outputs, not just yes and no. Not possible to use backpropagation (gradient descent) to train the model--the derivative of the function is a constant, and has no relation to the input, X. So it's not possible to go back and understand which weights in the input neurons can provide a better prediction. So a linear activation function turns the neural network into just one layer.

Pareek, Harsh H., Ravikumar, Pradeep K.

This paper presents a representation theory for permutation-valued functions, which in their general form can also be called listwise ranking functions. Pointwise ranking functions assign a score to each object independently, without taking into account the other objects under consideration; whereas listwise loss functions evaluate the set of scores assigned to all objects as a whole. In many supervised learning to rank tasks, it might be of interest to use listwise ranking functions instead; in particular, the Bayes Optimal ranking functions might themselves be listwise, especially if the loss function is listwise. A key caveat to using listwise ranking functions has been the lack of an appropriate representation theory for such functions. We show that a natural symmetricity assumption that we call exchangeability allows us to explicitly characterize the set of such exchangeable listwise ranking functions.

Lehnert, Lucas (Brown University, Providence, Rhode Island) | Laroche, Romain (Microsoft Maluuba, Montreal, QC) | Seijen, Harm van (Microsoft Maluuba, Montreal, QC)

In Reinforcement Learning, an intelligent agent has to make a sequence of decisions to accomplish a goal. If this sequence is long, then the agent has to plan over a long horizon. While learning the optimal policy and its value function is a well studied problem in Reinforcement Learning, this paper focuses on the structure of the optimal value function and how hard it is to represent the optimal value function. We show that the generalized Rademacher complexity of the hypothesis space of all optimal value functions is dependent on the planning horizon and independent of the state and action space size. Further, we present bounds on the action-gaps of action value functions and show that they can collapse if a long planning horizon is used. The theoretical results are verified empirically on randomly generated MDPs and on a grid-world fruit collection task using deep value function approximation. Our theoretical results highlight a connection between value function approximation and the Options framework and suggest that value functions should be decomposed along bottlenecks of the MDP's transition dynamics.