The Stories of My Projects

Intern

Spatial Pyramid Pooling (fit arbitrary sizes of images): Usually CNN requires a fixed-size (e.g., 224x224) input image due to fully-connected layers. When applied to images of arbitrary sizes, current methods mostly crop or pad the images to a fixed size, but the cropped region may not contain the entire object, while the wrapped content may result in unwanted geometric distortion. Therefore, we add a spatial pyramid pooling (SPP) layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs to feed into fully-connected layer. http://www.cnblogs.com/marsggbo/p/8572846.html
Online Hard Example Mining: It has always been detection datasets contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient.

Project1: Human Action Recognition Based on Spatial-Temporal Attention

Our goal is to identify human actions in short videos. The difference between video classification and image classification is that video contains temporal information while RGB image only contains spatial information. So we want our model to extract both spatial and temporal features of different actions. We first chop each video into frames. In our encoder model, we use pre-trained Vgg19 to extract spatial features of each frame, then we use LSTM as decoder to extract temporal features. After that we use a self-defined loss function. The loss function is the combination of cross-entropy loss and two penalty terms. The two penalty term is used to add restriction to spatial and temporal attention. Since we found that spatial attention is prone to ignoring some key parts in video frames, I guess that is due to it trapped to a local optimum, so we introduce a regularization term to avoid this problem.

Therefore, our model can pay attention to key part in each frame and focus on important frames in each video.

Project2: Speech Recognition by Attention-based Deep Neural Network

Project3: Movie Recommender System with MapReduce

User Collaborative Filtering: based on the similarity between users. Item Collaborative Filtering: based on the similarity between items.

e.g. |User| Movie 1| Movie 2| Movie 3| Movie 4| |—|—|—|—|—|

Why we choose Item CF:

The number of users weighs more than number of products.
Item will not change frequently, lowering calculation.
Using user’s historical data, more convincing.

When can we choose user CF: news (number of news weighs more than number of users, news will change frequently.)

How to define relationship between different movies? We can use user’s rating history, movie category, movie producer… Here we choose rating history, i.e., if one user rated two movies, these two are related. co-occurrence matrix: |User| M1| M2| M3| M4| |—|—|—|—|—| A|1|1| |1| | B|1|1|1| | | C| |1|1| |1| D| |1| |1| | —–> | |M1| M2| M3| M4| M5| |—|—|—|—|—|—| M1|2|2|1|1|0| M2|2|4|2|2|1| M3|1|2|2|0|1| M4|1|2|0|2|0| M5|0|1|1|0|1| Then we will do normalization to the co-occurrence matrix. rating matrix: |User| M1| M2| M3| M4| |—|—|—|—|—| A|9|4| |8| | B|3|7|8| | | C| |8|7| |4| D| |5| |8| | —–> |Movie|UserB rating| |—|—| M1|3| M2|7| M3|8| M4|0| M5|0| co-occurrence * rating matrix = result

cold start problem: each time a new movie is placed on the website, it goes through the cold start phase due to the lack of valuable user interactions. We can use contented-based filter.

The Stories of My Projects

Intern

Project1: Human Action Recognition Based on Spatial-Temporal Attention

Project2: Speech Recognition by Attention-based Deep Neural Network

Project3: Movie Recommender System with MapReduce

Project4: