- Joined
- Jan 8, 2010
- Messages
- 82,916
- Reaction score
- 75,138
- Location
- NE Ohio
- Gender
- Male
- Political Leaning
- Liberal
In this case, the dataset was split into demographics and educational attainment information.That's interesting. Can you explain more?
Demographics being stuff like mothers education, fathers education, family salary bins, zip code, etc. Educational data is stuff like attendance, grades, etc.
The labels (also known as targets, endogenous data, etc (this profession is bad with too much terminology)) are for predicting how bins (a, b, c, d, and f) and other targets.
Developing these models is usually pretty basic. You tend to use a correlation table, find the most important data, try it out. Then usually you just write a python function (or use a python package like Keiras, but it’s easier to just write a function) to run though every data combination to find the best items in the dataset (called parameters) and model tuning (called hyperparameters, did I mention that the terminology is stupid in this profession?). Then you test it again a test set using fancy statistics. Pick a few promising models and then see how it reacts to a bigger test set. If good then deploy it. I am oversimplifying, but this is the basic workflow.