Two-Stream Network for Sign Language Recognition and Translation A Loss Formulation

Neural Information Processing Systems 

T/ 4 is the length of the output. Figure 1: Illustration of keypoints used in our approach. Data augmentations include spatial cropping in the range of [0.7-1.0] and frame-rate augmentation We adopt identical data augmentations for RGB videos and heatmap sequences to maintain spatial and temporal consistency. We drop the sign pyramid networks in the inference stage. A CTC decoder is adopted to yield the final gloss predictions.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found