kaggle Project. The task of this project is to detect the malware based on features extracted from the API calls. More info on Kaggle website.
- Python 3.7
- Pytorch 10.1
- requirements.txt
- GPU with at least 1GB memory avaible (recommended)
- Downloads train and test data from kaggle
|
|
├───test # Downloads from kaggle
| ├───0.npy
│ ├───...
│ └───6050.npy
├───train # Downloads from kaggle
| ├───0.npy
│ ├───...
│ └───18661.npy
├───train_kaggle.csv # Downloads from kaggle
|
|
├───train.py # Epoch training
├───test.py # Generates solution.csv which can be submitted
├───model.py # Model
├───run.py # Starts training
└───dataset.py # Used to provide data in batch
In this project, we are using the same model as described in the paper: Dynamic Malware Analysis with Feature Engineering and Feature Learning. The model structure is shown below:
- Input: N×C×L tensor, where N is batch size, C is feature size (102) and L is the max sequence length(1000).
batchSize
: 50
- Batch Normalization: It speeds up the process of convergence.
- Gated CNN: It extracts the usable features from the raw input.
gated_cnn_outputs
: 128gated_cnn_stride1
: 1gated_cnn_stride2
: 1gated_cnn_kernel1
: 2gated_cnn_kernel2
: 3
- BiLSTM: The input features are with sequential patterns and we use bi-directional LSTM to understandboth the past and future context.
lstm_layers
: 1lstm_neurons
: 100
- MaxPool1D: Extracts the most important features from the hidden states generated by BiLSTM.
- Dense: Reduces the dimension of feature space.
fc_outputs
: 64
- Dropout: Defeats overfitting.
dropout
: 0.5
- Sigmoid: Generates probabilities for binary classification.
Exp logs
Exp | Description |
---|---|
1573179669 | seed:28 90% train, 10% validation, pc |
1573200428 | seed:29 95% train, 5% validation, pc |
1573204629 | seed:29 95% train, 5% validation, server |
1573983562 | pc, batch 50 |
1574035600 | server, batch 25 |
1574035703 | server, batch 100 |
Python run.py # all the hyperparameters can be set inside run.py