This repository contains the datasets used in the survey analysis and the Reddit post analysis.
The survey/ folder contains several files related to the survey portion of the paper.
-
survey_questions.pdfcontains a complete list of survey questions. -
survey_dataset.csvcontains the full dataset of responses. -
survey_analysis.ipynbcontains the scripts used for all of the statistical analyses. -
annotated_job_titles.csvcontains the manual annotation of job titles provided from the survey.
The post_analysis/ folder contains several folders related to the Reddit post analysis:
-
/final_data/contains our final post results from each step. Our fully labeled dataset of 666 posts is contained in the file titledfinal_posts_666.json, whereasraw_dataset_2986.jsoncontains the full set of extracted posts, andprivacy_dev_dataset_2248.jsoncontains all posts checked for their relation to privacy and development, both manually and via GPT-5. The two folders,/GDPR_challenges/and/privacy_dev/, contain the ground truth files, the final predicted LLM files, and the LLM predictions after manual corrections. -
/prompts/contains our prompts for each step (zeroshot, fewshot, and Chain-of-Thought), with the CoT prompt in each subfolder being our final prompt used. -
/results/contains the raw results of each automated labeling step for each model. The training data for each step was tested on multiple prompts (i.e., zero-shot, few-shot, and Chain-of-Thought) and contains the respective results for each of the 8 models, while the test data contains the results for the final CoT prompt used. -
/samples/contains the individual sample files used to train and test the models in each stage. -
scripts/contains all relevant scripts, including those used to run each model as well as those used for statistics.