Feasibility study towards AI-based identification of adolescent mental health discourse on social media
Senior Data Scientist, Informatics

Overview
The overall goal of the project was to extend a previous feasibility study towards AI-based identification of mental health discourse among adolescents on social media. Previous work, carried out in collaboration with UNICEF, has established the feasibility of the topical component, that is detecting social media posts relating to mental health. However, the age component presented challenges, seeing as it is far more difficult to accurately attribute age based solely on produced text. This is especially pronounced when the topic is fixed, as topic is one of the key context cues in author profiling.
The specific gap the EMH Seed fund project was aimed to address was two-fold:
Lack of relevant, social media
Feasibility assessment of detection of discourse
The hope had been, should results be promising, to provide a proof-of-concept for developing a tool for timely, population-level lens on youth mental health. Such a tool, when combined with traditional social research methods, could aid prioritisation efforts, e.g. at UNICEF. Further, it could provide basis for novel research at the intersection of data mining, social science and psychology; e.g. examining temporality, network effects, geography, etc.
Outcomes
The key outputs from the project are:
An age-annotated dataset of social media posts, based on a snapshot of data from the selected platform, Reddit, together with the computational pipeline for automated age annotation (specific to the platform).
Assessment of performance for two popular lightweight Large Language Models, GPT-3.5 and Mistral Instruct v.0.2, under a few hand-crafted prompts (the typical and most straightforward manner of interacting with LLMs).
Summary of work & results in Power Point:
Technical report for internal consumption & developers, e.g. follow-up work
(Version for the broader audience in discussions with UNICEF)
The age annotated dataset and pipeline is a minor contribution, and is intended as foundation for any further development of the overarching approach; it may also be of interest to computational NLP communities. Main result is the evaluation of feasibility of LLM-based identification of adolescent mental health discourse. In terms of reported metrics the performance of LLMs is mixed, exhibiting different trade-offs with different prompting configurations. However, considering that any developed tool would not be a sole decision aid, these results show promise for a big-data approach to monitoring social media for adolescent mental health discourse.
Future Directions
The planned-for next step is a presentation to UNICEF and possibly further collaboration to move beyond feasibility assessment / proof-of-concept and towards operationalising. It is important to note that operationalising may be difficult in the near term, following recent changes to several social media platforms.
In parallel to carrying out the project, the main researcher also supervised an MSc student on another direction extending the original work. Together with the results obtained here, these may lead to a publication.
Finally, the overall positive assessment of feasibility opens the door to further downstream research. Several directions are possible, e.g. temporality or network effects, as mentioned already. Of particular interest in the research group are novel / advanced data mining approaches towards building a causal understanding of drivers and progression of mental health issues from unstructured data. We are particularly interested in the Wellcome Mental Health Award: Transforming early intervention for anxiety, depression and psychosis in young people funding, but would require a larger team with complementary expertise.