I am working on a project using the Berka dataset, and I want to build a neural network to predict the loan status for accounts. The dataset contains multiple tables, and I want to avoid flattening them into a single table. Instead, I aim to feed the NN structured data as follows:
Input: For each account, include:
- All transaction records associated with the account
- Account-level details
- Data from "disp" and "card" tables
- Order records
Output: Predict the loan status (A, B, C, D).
Here are the dimensions after processing:
Transactions: (682, 675, 24)
Account: (682, 8)
Disp and Card: (682, 3, 9)
Order: (682, 5, 6)
Labels (Y): (682, 4) (I have used OneHotEncoding to transform into binary representation)
where the last number means the number of features and the second number (where there are 3) is the number of entries per 1 account (padded)
The distribution of loan_status of Y are as follows:
C: 403,
A: 203,
D: 45,
B: 31
Classes D (debt) and B (default) are minority classes, but the model needs high recall for these.
Neural Network
I plan to use an LSTM for the variable-length tables (e.g., transactions). Below are my input layers:
transaction_input = Input(shape=(None, num_transaction_features), name="transaction_input")
account_input = Input(shape=(num_account_features,), name="account_input")
disp_card_input = Input(shape=(None, num_disp_card_features), name="disp_card_input")
order_input = Input(shape=(None, num_order_features), name="order_input")
My current architecture
x_trans = Masking(mask_value=0.0)(transaction_input) # Mask padding values
x_trans = LSTM(64, return_sequences=True)(x_trans) # LSTM
x_trans = LSTM(64, return_sequences=False)(x_trans) # LSTM
x_trans = BatchNormalization()(x_trans)
x_trans = Dense(32, activation="relu")(x_trans) # Dense
x_trans = Dense(16, activation="relu")(x_trans)
# Process account data with Dense layers
x_account = Dense(32, activation="relu")(account_input)
x_account = Dense(16, activation="relu")(account_input)
x_account = BatchNormalization()(x_account)
x_disp_card = Masking(mask_value=0.0)(disp_card_input)
x_disp_card = LSTM(32, return_sequences=False)(x_disp_card)
x_disp_card = Dense(16, activation="relu")(x_disp_card)
x_order = Masking(mask_value=0.0)(order_input) # Mask padding values
x_order = LSTM(16, return_sequences=False)(x_order) # LSTM
x_order = Dense(16, activation="relu")(x_order)
# Dense layer with regularization
combined = Concatenate()([x_trans, x_account, x_disp_card, x_order])
# Optionally, add another layer to see if it helps
combined = Dense(16, activation="relu")(combined)
combined = Dense(16, activation="relu")(combined)
# Output layer
output = Dense(4, activation="softmax")(combined) # classification output
Currently, my model is awful, having this as the classification report:
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| A | 0.39 | 0.53 | 0.45 | 76 |
| B | 0.00 | 0.00 | 0.00 | 0 |
| C | 0.87 | 0.74 | 0.80 | 236 |
| D | 0.73 | 0.55 | 0.63 | 29 |
| Accuracy | 0.68 | 341 | ||
| Macro Avg | 0.50 | 0.45 | 0.47 | 341 |
| Weighted Avg | 0.75 | 0.68 | 0.71 | 341 |
Questions
- Is my approach to feeding structured data into the NN correct?
- How should I configure the LSTM layers to handle variable-length inputs like transactions?
- What should I use instead of LSTM layers for one-to-many relationship data without any sequential pattern? (1 account may have more than 1 owner/user, or may have many standing orders)?
- How can I improve my model's accuracy and recall for minority classes (
DandB)? - What are the metrics I should be aiming for? As a bank, I don't want to authorise any loans for people who are likely to not pay back