RP1M: Robot Piano-Playing 1 Million Dataset

A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands

RP1M

A Large-Scale Motion Dataset for Piano Playing
with Bi-Manual Dexterous Robot Hands

CoRL 2024

*Equal Contribution    1Max Planck Institute for Intelligent Systems    2Aalto University    3University of Southern California    4University of Oulu

Our RP1M dataset is the first large-scale dataset of dynamic, bi-manual manipulation with dexterous robot hands.
It includes bi-manual robot piano playing motion data of ~1M expert trajectories covering ~2k musical pieces.

The expert trajectories in RP1M are collected via training RL agents for each of the 2k songs and roll out each policy 500 times with different random seeds. Our method does not require any human demonstrations or fingering annotations. Given the MIDI file only, fingering is discovered automatically while playing the piano.

How does the agent learn to play the piano given only the MIDI file?

Although fingering is highly personalized, generally speaking, it helps pianists to press keys timely and efficiently. Motivated by this, apart from maximizing the key pressing rewards, we also aim to minimize the moving distances of fingers. Specifically, at time step \(t\), for the \(i\)-th key \(k^i\) to press, we use the \(j\)-th finger \(f^j\) to press this key such that the overall moving cost is minimized. We define the minimized cumulative moving distance as \(d_t^{\text{OT}} \in \mathbb{R}^+\), which is given by:

\[ \begin{split} d_t^{\text{OT}} =& \min_{w_t} \sum_{(i,j)\in K_t \times F}w_t(k^i, f^j)\cdot c_t(k^i, f^j), ~~ \text{s.t.}, ~~ i)~ \sum_{j\in F}w_t(k^i, f^j) = 1, ~~ \text{for} ~~ i \in K_t, \\ & ii)~ \sum_{i\in K_t}w_t(k^i, f^j) \leq 1, ~~ \text{for} ~~ j \in F, ~~~ iii)~~ w_t(k^i, f^j) \in \{0, 1 \}, ~~ \text{for} ~~ (i, j) \in K_t \times F. \end{split} \]

Here, \(K_t\) represents the set of keys to press at time step \(t\) and \(F\) represents the fingers of the robot hands. \(c_t(k^i, f^j)\) represents the cost of moving finger \(f^j\) to piano key \(k^i\) at time step \(t\) calculated by their Euclidean distance. \(w_t(k^i, f^j)\) is a boolean weight. In our case, it enforces that each key in \(K_t\) will be pressed by only one finger in \(F\), and each finger presses at most one key. The constrained optimization problem in the equation is an optimal transport problem. Intuitively, it tries to find the best "transport" strategy such that the overall cost of moving (a subset of) fingers \(F\) to keys \(K_t\) is minimized.

The fingering discovered by the agent itself vs human pianist annotations

We compare the fingering discovered by the agent itself and the human annotations. We visualize the sample trajectory of playing French Suite No.5 Sarabande and the corresponding fingering. The agent discovers different fingering compared to humans. For example, for the right hand, humans mainly use the middle and ring fingers, while our agent uses the thumb and first finger. Furthermore, in some cases, human annotations are not suitable for the robot hand due to different morphologies. For example, in the second time step of the figure, the human uses the first finger and ring finger. However, due to the mechanical limitation of the robot hand, it can not press keys that far apart with these two fingers, thus mimicking human fingering will miss one key. Instead, our agent discovered to use the thumb and little finger, which satisfies the hardware limitation and accurately presses the target keys.

Performance

The performance of our method is on part with the one that requires human annotated fingering, and surpass the one without fingering info by a large margin.

Highly dynamic hand motions
Long distance hand movements

Cross Embodiment

Since our approach does not require human demonstrations or fingering annotations, it can be easily transferred to robot hands with different morphologies, even other robotic platforms.

5-fingers hands
4-fingers hands

More Results