Radman Rakhshandehroo

Reinforcement Learning & AI

ICML 2025 Review

Paper Notes

Here are my cleaned notes on the papers (there are a lot more unclean ones). Papers:

  1. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models: Interesting paper on the usage of the VLAs for complex ideas which leverages hierchical data along with synthetic data. Begins with a user prompt e.g. make me a sandwich, the high level VLM given 3 2D images (base and wrist mounted cameras) and the user prompt, outputs a specific task to be done which is passed to the low level VLA policy e.g. get the bun. High level VLM is trained on the robot data collection (accumulation of open source and other datasets) which are augmented with human annotation (e.g. arm is picking up the Kitkat). Using labelled data, VLM is prompted make synthetic user instructions which maps the image observation and user prompt into robot verbal response and movement of the arm. There are ablation studies on different tasks, flat VLA (no hierarchy), and training with and without the synthetic data and impacts on the performance. Demos can be found on their website.
  2. Flow Q-Learning (FQL:): Simple and performant offline reinforcement learning (RL) method that leverages an expressive flow-matching policy to model arbitrarily complex action distributions in data. Flow policies generate actions through multiple iterative steps and backpropagating RL gradients through this iterative process is unstable. Instead of one policy, they use two separate policies with different roles: The flow policy learns more complex action distributions and multimodality. It is trained via behavioural cloning using offline data. Usage of behavioural cloning dodge the issue of unstable gradients. The one step (student) policy is trained with RL and one step distillation from flow policy which gets the RL rewards while staying close to the the expressive policy. With regard to results, their FQL approach has a higher mean that other approaches. Demo of the experiments are here.
  3. SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation: Follow up work on the segment anything 2 (SAM2) model, a computer vision foundational model that segment objects in images. Using SAM2's object detection features + multi-resolution upsampling, (going from global context to texture details in stages) + Multi view transformer, they created SAM2Act. Additionally, it can have a memory extension. called SAM2Act+ which has a FIFO queue to store observations, memory encoder and memory attention. They achieve 87% success rate on 18 RLBench tasks. 4% drop in performance if working under perturbations in the environment. 94% success rate on MemoryBench tasks (introduced by the authors). Requires task specific memory window tuning and not extended to dexterous continuous control.
  4. VIP: Vision Instructed Pre-training for Robotic Manipulation: Authors suggests that vision instructions work better than text as instructions for robots due to limited training data, domain gap (lack of image-text pairs) and ambiguity in text. Their VIP solution uses observation, future frame (what the scene should look like after the action s+1s_{+1}) and sparse point flows (critical movement patterns between sts_{t} and st+1s_{{t+1}}). Pre-training uses full information and then does progressive masking of the point flows during training and robot learning to work with current image and future images. During inference, since the future frames are not available, robot is given a cropped image of target object instaed of the future frame and current observation. Leverages VIRT (Vision Instructed Robotic Transformer) with transformer layer initialized from DINOv2. Loss objective is focused on automatically identifing which parts of a task require precision versus which parts naturally have variation. Result videos are here, I am very curious why the arm is somewhat jittery.
  5. Flow-based Domain Randomization for Learning and Sequencing Robotic Skills: This paper tries to address the issue of domain randomization so that robots can handle the real-world variations by changing parameters like friction, mass and lighting. Key issue is that how much domain randomization should be done. Too little is meaningless and too much would make the environments unsolvable. Authors introduce GoFlow as a novel approach for learned domain randomization that combines actor-critic reinforcement learning architectures with a neural sampling distribution to learn robust policies that generalize to real-world settings. Goal is maximize: expected reward + α\alpha entropy - β\beta stability. Results show more robust policies are achieved in different simulation environments.
  6. Latent Diffusion Planning for Imitation Learning: Authors introduce Latent Diffusion Planning (LDP) as a new approach to imitation learning that can use multiple types of training data beyond expert demonstrations. Researchers often have access to action-free data such as videos of tasks being performed without recorded action or suboptimal data like failed attempts or imperfect actions. LDP tries to incorporate these approaches by dividing the control problem into two part. (1) Planning which predicts the sequence of the future states. (2) action extraction which determines the action that must be taken to reach said states. This separation allows the action-free data to be used by the planner. Furthermore, inverse dynamics model uses the suboptimal data since its state to action mapping is general (I am confused by how performance does not deteriorate here). The LDP uses a VAE as a visual encoder to derive the latent representations and uses diffusion models to forecast sequences of future latent states. Finally, LDP's inverse dynamics model uses diffusion models to extract actions from the consecutive latents states. At inference time, robot executes the actions outputted by the inverse dynamics model using a receding horizon control. Results shows that LDP when outperforms baselines when additional data is available. Action free data improve planning capabilities. Ablation of the study compare a flat LDP with a hierarchical version. Dense temporal forecasting performs better than hierarchical approaches.
  7. Reward-free World Models for Online Imitation Learning: paper introduces IQ-MPC, Inverse Soft Q learning for Model predictive control. It combines the online imitation learning with world models to learn complex tasks from expert demonstrations. Traditional imitation learning has the following challenges: (1) Offline methods struggle when the robot stumbles upon a state that was not seen before (2) Online methods that use adversarial training (like GAIL) can be unstable in high dimensional environments (3) Existing approaches struggle with complex task that involve comprehensive observations like vision and intricate dynamics. Unlike traditional models that need reward signals, IQ-MPC learns: (1) Model environment dynamics in a latent space (2) Extracts rewards directly from Q values instead of learning a separate reward model. (3) Uses planning for control instead of direct policy execution. The models learns from a combination of the expert demonstrations and the behavioural data that is collected during training. No reward models are needed and the rewards are decoded from the Q values instead. At inference, Model predictive path integral (MPPI) is used for planning and the sampled action sequences are evaluated using the learned dynamics and their respective rewards are decode to choose the best action. Results shows that the method outperforms baselines in the locomotion tasks (DMControl), manipulation tasks (Myosuite) and visual tasks. They also show theoretically that that the learning objective minimize the bound on the performance difference between the current policy and expert policy.

Works to explore more

Some other interesting works:

Events

I personally really enjoyed the invited talks of the workshops and think you might too once they are opened up. The talks I liked were the ones given by Hao Su (Computer graphics and robotics), Masatoshi Uehara (reward guided generation in diffusion models), Pratyush Maini (model memorization research), Wenhao Yu (Robotics and Gemini-ER), Jiajun Wu (physical scene understanding) and Sergey Levine (Exploration with prior knowledge).

P.S. Hope there aren’t too many typos, I think I caught most.