How can I train a model on both text and numeric data?

Posted by boringblobking@reddit | LocalLLaMA | View on Reddit | 6 comments

Is there a standard way of doing this? E.g. if you have patient data taken from GP records. The data consists of things like their age, gender, and whether they smoke, which are discrete values, but it also contains text data describing their conditions etc. How would you train a single model with all of this data to make inferences?

One idea I had was to just build a vanilla neural network that takes in the discrete data as input parameters, and for the text, use BERT to encode the text and use the encodings as input as well to my vanilla neural network. Is this likely to work? Is there a more standard way of dealing with such situations?

[-]

Armistice_11@reddit

Very vague way of questioning or seeking suggestion. Please be more clear.

From what I read - you are trying to create a single model that can take different modalities of data and can train upon a sequence of activities that ranges from clinical text notes to clinical biomarkers with occasional symptoms and occurrences of such as in time series. If you want to deal with multimodality data, first take a step back and check about Multi Task Learning, Modal Encoders and Transfer learning to start with. Remember that Data Harmonization is very important.

I remember this paper when it got published by Nature. Take a dig at this - walks straight your avenue.

https://www.nature.com/articles/s41467-023-37477-x

boringblobking@reddit (OP)

this article looks very interesting and i will have to look into multi task learning, modal encoders and transfer learning. thank you

Let us know how you proceed on this.

i may or may not go ahead with the idea i had for this but why is it you would like to know?

Perhaps open source the work that can benefit the healthcare domain.

ok nice i would like that too