About Me

I am Shentong Mo (莫申童) from Beijing, China.
I am currently a master student of Electrical and Computer Engineering in CMU.
I have been exploring a variety of topics through my past research projects, such as head pose estimation, forecasting, action localization, object detection, and face recognition.
I worked closely with Prof. Miao Xin from Chinese Academy of Science, Wee Peng Tay from Nanyang Technological University, and Qiang Zou from Tianjin University. I learned a great deal from my awesome advisors.

Education

Carnegie Mellon University

Master of Science in Electrical and Computer Engineering May 2020

Tianjin University

Master of Engineering in Electronics September 2016

Tianjin University

Bachelor of Engineering in Electrical and Electronics September 2012

Projects

Spatio-Temporal Action Localization

Project August 2020 - Present

YOWOv2: We will improve localization recall and classification accuracy based on the only end to end architecture in spatio-temporal action localization called You Only Watch Once (YOWO).
Datasets: We propose to use the public Joint-annotated Human Motion Data Base (J-HMDB-21) and a private dataset of restaurant videos generously supported by CMU-based startup Agot.AI.
Long-term Feature Banks: We will work on how to connect long-term context with short-term information to improve action recognition performance.
Evaluation Metrics: We plan to apply the two popular metrics in the field of spatiotemporal action recognition, frame-mean Average Precision (mAP) and video-mAP, following the YOWO paper.

Deep Variational Disentangled Audio-Visual Embeddings under Unsupervised Learning

Project August 2020 - Present

Cross VAE: We propose a cross-modal sequence VAE to disentangle both audio and image clips into time-invariant(identity) and time-variant(content) latent embeddings under unsupervised learning.
Datasets: We will use VoxCeleb2 dataset as training datset in our experiment. VoxCeleb2 contains over 1 million utterances for 6,112 celebrities, extracted from videos uploaded to YouTube.
Evaluation Tasks: We will evaluate our approach in three popular speech tasks, speaker recognition, automatic speech recogntion (ASR) and voice conversion. All of these tasks will be trained with VoxCeleb2 and then be evaluated on VoxCeleb1.

Long-term Head Pose Forecasting Conditioned on the Gaze Prior

Computer Vision Algorithm Engineer March 2020 - August 2020

CVGAE: We propose the conditional variational graph autoencoder (CVGAE), a deep conditional generative model for self-supervised learning on graph-structured output prediction using Gaussian latent variables.
One-to-many Mappings: CVGAE is capable of learning restricted one-to-many mappings conditioned on the graph-structured input data.
Gaze Prior: We introduce a gaze prior as the condition for the proposed CVGAE on long-term head pose forecasting problems.
Experiment Results: Experiments demonstrate the effectiveness of our proposed model and the importance of the gaze prior condition for this task. We achieve superior long-term head pose forecasting performance on the BIWI dataset.

Distributed Face Recognition Algorithms for Multi-camera Networks

Research Assistant August 2018 - November 2018

Data Collection: We collect our own detected faces from lab members to build train set and create face feature library for MobilefaceNet verification.
Model Training: We fintune MobilefaceNet with pretrained weights on our train set, achieve 99.4% AP on LFW dataset and 78.6% AP on our test set.
Multi-camera: We apply MobilefaceNet to Raspberry Pi devices that are connected to multi-camera for real-time face recognition and inference the pipeline latency.
Pipeline: We utilize Multi-task Cascaded Convolutional Networks (MTCNN) for detecting face, apply MobileNet/MobilefaceNet to extract face features and acquire recognition results by calculating cosine similarity between these features and those from our feature library.

Algorithms for Real-time Detection of Intestine Polyps

Research Assistant March 2017 - July 2019

Data Collection and Labeling: We collect around 10,000 polyp images from four hospitals that we collaborated and finish labelling with the help of our experts from these hospitals.
Data Augmentation: We utilize several data augmentation techniques, like horizontal flipping, cropping, padding and rotation, to our original data for increasing the number of our training set to 60,000.
Model Training: We use Detectron2/Darknet to train the Faster R-CNN/YOLOv2 model respectively on the training set and achieve 80.5% mAP and 78.4% mAP on our test set (10,000).
CPDS: We apply the trained model in a system for recognizing intestine polyps for real-time detection and conduct online learning with the addition of new polyp images to achieve a computer-aided polyp diagnosis system (CPDS).

Skills

DL Frameworks: PyTorch, TensorFlow, Keras, and MegEngine(Contributor)
Utilities: MATLAB, Octave, and LaTeX
OS Scripting: Linux Shell Scripts and AppleScripts
Version Control: Git commands and GitFlow

  • Python
  • C/C++
  • Java
  • C#
  • CSS
  • HTML5

Conference

CVGAE: long-term head pose forecasting conditioned on the gaze prior

Shentong Mo, Miao Xin. 2020 To submit

In unsupervised learning, variational graph auto-encoder (VGAE) has shown evident advantages in learning latent representation of graph-structured data, and generating diverse predictions through structured one-to-many mapping. However,most of the generated results are always far from the reality due to the lack of the prior restriction. In this paper, we propose the conditional variational graph autoencoder (CVGAE), a deep conditional generative model for self-supervised learning on graph-structured output prediction using Gaussian latent variables. This model is capable of learning restricted one-to-many mappings conditioned on the graph-structured input data. Furthermore, we introduce a gaze prior as the condition for the proposed CVGAE on long-term head pose forecasting problems. Experiments demonstrate the effectiveness of our proposed model and the importance of the gaze prior condition for this task. We achieve superior long-term head pose forecasting performance on the BIWI dataset as compared to most existing methods.