Atharva Bhagwat

Separating Perception and Reasoning via Relation Networks

Visual Question Answering (VQA) is a multi-modal task relating text and images through captions or a questionnaire. For example, with a picture of a busy highway, there could be a question: “How many red cars are there?” or “Are there more motorbikes than cars?”. It is a very challenging task since it requires high-level understanding of both the text and the image and the relationships between them.

In this project, we study the Relation Networks implementations of AI approaches that offer the ability to combine neural and symbolic representations to answer VQA task.

Relation Network

Relation Network is a neural network that works best for relational reasoning. The design philosophy behind RNs is to constrain the functional form of a neural network so that it captures the core common properties of relational reasoning.

The simplest form of RN composite function is: $RN(O) = f_ϕ(Σ_{i,j}g_θ(o_i, o_j))$ where,

Sort-of-CLEVR Dataset

We’ll be using Sort-of-CLEVR dataset for our RN network.

The Sort-of-CLEVR dataset contains 10000 images of size 75 75 3, 200 of which are used as the validation set.

There are 20 questions for each image (10 non-relational and 10 relational).

The non-relational questions are divided into 3 subtypes:

The relational questions are divided into 3 subtypes:

Questions are encoded as binary strings of length 11, where the first 6 bits identify the color of the object mentioned in the question, as a one-hot vector, and the last 5 bits identify the question type and the subtype.

[index_0-5 → one-hot vector of 6 colors,

index_6 → non_rel_ques,

index_7 → rel_ques,

index_8,9,10 → question_subtype]

Answers are represented as a one-hot vector of size 10.

For non-relational questions the answers are of the form:

[index_0,1 → yes, no,

index_2,3 → is rectangle?, is circle?,

index_4-9 → one-hot vector of 6 colors (red, green, blue, orange, gray, yellow)]

For relational questions the answers are of the form:

[index_0,1 → n/a, n/a,

index_2,3 → is rectangle?, is circle?,

index_4-9 → count one-hot vector to denote the number of objects 0, 1, ..., 5]

RN Model Architecture

rn_arch

Data flow through the network is as follows:-

Cross entropy loss function with Adam optimizer is used to train the network.

Hyperparameters for the network:

Object pairs with questions, for 1 image

obj_pairs

Questions:

Feature Map:

Object pairs with questions:

Check out the code for detailed description on shapes and data transformation.

VQA on Sort-of-CLEVR

Results

After training the model for 50 epochs, the best final metrics are as follows:

Relational Data

Train Accuracy(%) Train Loss Test Accuracy(%) Test Loss
98.40 0.045 91.28 0.304

Non-Relational Data

Train Accuracy(%) Train Loss Test Accuracy(%) Test Loss
99.98 0.001 99.97 0.001

Sample test outputs:

test_0

test_45


Team: Atharva Bhagwat, Harini Appansrinivasan, Abdulqadir Zakir

Here is the link to the repository.