SegFuse: Dynamic Driving Scene Segmentation

SegFuse: Dynamic Driving Scene Segmentation

SegFuse is a semantic video scene segmentation competition that aims at finding the best way to utilize temporal information to help improving the perception of driving scenes. Given a video of front driving scenes with corresponding driving state data, can you fuse various kinds of information together to build a state-of-the-art dynamic perception model?

(Segmentation video coming soon)


In this competition, competitors are given three long, untrimmed videos of driving scene, along with various driving state data aligned by timestamp at each frame. We provide dense, per-frame pixel annotations of the scene for two of the videos, serving as the ground truth for training and validation. For testing, competitors are instructed to upload a YouTube video showing their prediction of driving scene. This competition is open to the world and the dataset will be publicly available for academic and research use. As part of MIT 6.S094: Deep Learning for Self-Driving Cars, some instructions and starter code will also be available on Github.


In order to enter the competition, first please register an account on the site if you haven’t already. After logged in, please click here to download the dataset and instructions (will release soon). Competitors are free to use all kinds of machine learning / deep learning methods with / without external data. After training the model, run the model on testing videos, get prediction and run the provided code to generate a submission video. Finally, upload your video to YouTube and submit the link, get your score and try hard to make it on the Leaderboard!

For people new to this topic, we provide a simple starter code using Python + Keras, which builds a convolutional neural network with encoder-decoder structure. We also provide the pre-computed state-of-the-art per-frame segmentation (by ResNet-DUC [3] pre-trained on Cityscapes [1]) for people who want to start quickly and focus on modeling temporal information. After you get enough experience on this, use your own method to generate prediction on the testing set.


The MIT-SegFuse Dataset aims at understanding the dynamic driving scenes. We provide a long, untrimmed video (6:35, 11869 frames) at 720p (1280×720), 30 fps, which is a single daytime trip around MIT, with fine, per-frame, pixelwise semantic annotation. This video is split into three parts, 3:20 (6000 frames) for training, 0:40 (1200 frames) for validation, and the rest 2:35 (4669 frames) for testing. Driving state data including vehicle acceleration and steering angles are also provided for each frame, aligned by timestamp. In order to take advantage of past research, we follow the same labeling policy and class definition as the Cityscapes [1] dataset. In addition, we also provide pre-computed optical flow (by FlowNet 2.0 [4] pre-trained on KITTI [2]) and pre-computed per-frame segmentation (by ResNet-DUC [3] pre-trained on Cityscapes [1]). The latter one also serves as the baseline for this competition.

(The dataset will be released shortly)


Submission link and leaderboard will be online soon, please stay tuned.  Register an account to receive news and notifications.


We use the standard Jaccard Index, commonly known as the PASCAL VOC intersection-over-union metric IoU = TP/(TP+FP+FN) [5], where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. The final score is calculated as the mean over all the classes.


Letting machine understand the whole driving scene is crucial to autonomous driving. Recently, there has been great progress on semantic image segmentation of driving scene, with large-scale public datasets such as Cityscapes [1], Mapillary [7], KITTI [2], BDD [6]. However, most of the current work focuses on static image segmentation, which is not utilizing rich temporal information among consecutive frames. The Camvid [8] dataset provides annotation of video sequences, but is relatively small and lacks corresponding driving state data. In this case, we introduce a novel semantic video segmentation challenge: SegFuse. The goal is to push forward driving scene perception from image-level to video-level. Out the of driving scenario, DAVIS [9] is a good source for video segmentation and propagation, which is similar to this work.

It is easy to notice that neighboring frames usually feature very high visual similarity. Such redundancy is not ideal for model training, especially deep neural networks. As a result, most of other scene datasets tend to have different scene images for a robust perception. We recommend competitors to use external data or start from some pre-trained models.


We provide a few open research questions where people may find this dataset helpful. Ideas are welcomed, please let us know if you find out other interesting topics and would like to share with the world.

• Spatio-temporal Semantic Segmentation

As for the main competition, we are interested in finding a novel way to utilize temporal data, such as optical flow and driving state, to improve perception from using static image only.

• Predictive Modeling

Can we know ahead what is going to happen on the road? Predictive power can be crucial to the safety of autonomous driving. As we provide a dataset that is consistent in time, it can certainly be used for predictive perception research.

• Transfer Learning

How much extra data do we need if we have a perception system trained in Europe and want to use it in Boston, US? In practice, it is difficult to train an entire deep neural network from scratch, because it is relatively rare to have a dataset of sufficient size. Transfer learning is the key to achieve training a deep neural network with limited data.

• Deep Learning with Video Encoding

Most of the current Deep Learning systems are based on RGB encoded images. However, preserving exact RGB value for each single frame in the video is too expensive in terms of computation and storage. Can you push Deep Learning toward video-level?

• Solving Redundancy of Video Frames

This can be along with spatio-temporal modeling, but how can we efficiently find useful data from highly visually similar frames? Or, what shall be the best fps for a good perception system? Efficiency is one of the most important features of real-time applications.


[1] Cordts, Marius, et al. “The cityscapes dataset for semantic urban scene understanding.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[2] Geiger, Andreas, et al. “Vision meets robotics: The KITTI dataset.” The International Journal of Robotics Research 32.11 (2013): 1231-1237.
[3] Wang, Panqu, et al. “Understanding convolution for semantic segmentation.” arXiv preprint arXiv:1702.08502 (2017).
[4] Ilg, Eddy, et al. “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks.” IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
[5] M. Everingham, A. S. M. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes challenge: A retrospective,” IJCV, vol. 111, iss. 1, 2014.
[6] Xu, Huazhe, et al. “End-to-end learning of driving models from large-scale video datasets.“IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017.
[7] Neuhold, Gerhard, et al. “The mapillary vistas dataset for semantic understanding of street scenes.” Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy. 2017.
[8] Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. “Semantic object classes in video: A high-definition ground truth database.” Pattern Recognition Letters 30.2 (2009): 88-97.
[9] Oh, Sangmin, et al. “A large-scale benchmark dataset for event recognition in surveillance video.” Computer vision and pattern recognition (CVPR), 2011 IEEE conference on. IEEE, 2011.