diff --git a/CITATION.cff b/CITATION.cff
new file mode 100644
index 0000000..d5fc688
--- /dev/null
+++ b/CITATION.cff
@@ -0,0 +1,25 @@
+cff-version: 1.2.0
+message: "If you use this software, please cite it as below."
+type: software
+title: "CodebookAI: LLM-powered deductive coding for qualitative research"
+authors:
+  - family-names: Maier
+    given-names: Torsten
+    orcid: "https://orcid.org/0000-0000-0000-0000"
+    affiliation: "Kettering University"
+repository-code: "https://github.com/tmaier-kettering/CodebookAI"
+license: MIT
+abstract: >
+  CodebookAI is a desktop application that assists qualitative researchers in
+  applying deductive coding at scale using large language models. It provides
+  a graphical interface for single-label and multi-label text classification,
+  keyword extraction, and inter-rater reliability analysis via OpenAI's API.
+keywords:
+  - qualitative-research
+  - deductive-coding
+  - text-classification
+  - large-language-models
+  - openai
+  - inter-rater-reliability
+  - python
+  - tkinter
diff --git a/README.md b/README.md
index 3caf2ae..c165eb4 100644
--- a/README.md
+++ b/README.md
@@ -53,7 +53,7 @@ If you find this tool helpful, consider supporting its development by [buying me
 
 ## License
 
-This project is provided as-is for educational and research purposes. Please ensure compliance with OpenAI's usage policies when using this application.
+This project is licensed under the [MIT License](LICENSE). Please ensure compliance with OpenAI's usage policies when using this application.
 
 ## Disclaimer 
 This application requires an OpenAI API key and will incur costs based on your usage. Batch processing typically offers significant cost savings compared to individual API calls. Please monitor your usage and costs through the OpenAI dashboard.
diff --git a/paper.bib b/paper.bib
new file mode 100644
index 0000000..3a0e6fa
--- /dev/null
+++ b/paper.bib
@@ -0,0 +1,37 @@
+@book{saldana2021,
+  author    = {Saldaña, Johnny},
+  title     = {The Coding Manual for Qualitative Researchers},
+  year      = {2021},
+  edition   = {4th},
+  publisher = {SAGE Publications},
+  address   = {London}
+}
+
+@article{cohen1960,
+  author  = {Cohen, Jacob},
+  title   = {A coefficient of agreement for nominal scales},
+  journal = {Educational and Psychological Measurement},
+  year    = {1960},
+  volume  = {20},
+  number  = {1},
+  pages   = {37--46},
+  doi     = {10.1177/001316446002000104}
+}
+
+@article{mayring2000,
+  author  = {Mayring, Philipp},
+  title   = {Qualitative Content Analysis},
+  journal = {Forum: Qualitative Social Research},
+  year    = {2000},
+  volume  = {1},
+  number  = {2},
+  url     = {https://www.qualitative-research.net/index.php/fqs/article/view/1089}
+}
+
+@techreport{openai2024,
+  author      = {{OpenAI}},
+  title       = {{GPT-4} Technical Report},
+  year        = {2023},
+  institution = {OpenAI},
+  url         = {https://arxiv.org/abs/2303.08774}
+}
diff --git a/paper.md b/paper.md
index 293517e..0f18838 100644
--- a/paper.md
+++ b/paper.md
@@ -4,56 +4,66 @@ tags:
   - python
   - qualitative-research
   - text-classification
-  - llm
+  - large-language-models
   - openai
   - deductive-coding
+  - inter-rater-reliability
 authors:
   - name: "Torsten Maier"
-    orcid: "YOUR-ORCID-HERE"
-    affiliation: "Kettering University"
-date: "2026-04-30"
+    orcid: "0000-0000-0000-0000"
+    affiliation: 1
+affiliations:
+  - name: Kettering University, United States
+    index: 1
+date: 30 April 2026
+bibliography: paper.bib
 ---
 
-## Summary {#summary}
+## Summary
 
-CodebookAI is a desktop application that automates deductive coding, also known as text classification, for qualitative research workflows using OpenAI's API. Researchers upload text segments (such as interview excerpts or survey responses) along with a deductive codebook defining categories, and the tool classifies each segment by sending context-aware prompts to large language models and parsing the responses into structured output.
+CodebookAI is a desktop application for qualitative researchers that automates deductive coding—the systematic assignment of predefined category labels to text segments—using large language models (LLMs). Researchers in social sciences, education, health sciences, and related disciplines routinely analyze interview transcripts, open-ended survey responses, and similar text by applying codes from a structured codebook [@saldana2021]. This process is labor-intensive when datasets contain hundreds or thousands of segments.
 
-Unlike manual coding or rigid keyword-matching tools, CodebookAI leverages LLMs' understanding of context, nuance, and natural language to apply deductive codes consistently at scale. It supports researchers in social sciences, education, and health sciences who need to process large qualitative datasets efficiently. The tool handles tens of thousands of text segments via OpenAI's batch API, delivering results within 24 hours, and exports coded data as spreadsheets ready for analysis in tools like R or SPSS.
+CodebookAI provides a Tkinter-based graphical interface through which researchers upload a codebook (a flat list of category names) and text segments from CSV, TSV, Excel, or Parquet files. The application constructs structured prompts, enforces codebook-fidelity through JSON-schema-constrained outputs validated by Pydantic, and returns coded results as spreadsheets ready for analysis in tools such as R, SPSS, or Excel.
 
-The application provides a simple GUI for codebook definition, text import, job submission, and result review, with robust error handling for API failures and malformed responses. CodebookAI lowers the barrier to LLM-assisted qualitative analysis while maintaining researcher control over codes and outputs.
+The application supports two processing modes—**batch** and **live**—and four primary workflows:
 
-## Statement of need {#statement-of-need}
+1. **Single-label classification**: assign exactly one codebook category to each text segment.
+2. **Multi-label classification**: assign one or more categories to each segment.
+3. **Keyword extraction**: extract free-form keyword lists from text without a predefined codebook.
+4. **Inter-rater reliability analysis**: compute percent agreement and Cohen's kappa [@cohen1960] between two independently coded datasets, with export to Excel.
 
-Manual deductive coding remains labor-intensive for qualitative researchers, often requiring weeks to classify hundreds of text segments. Existing tools either rely on brittle keyword matching or require custom machine learning pipelines that demand data science expertise. Large language models offer a promising alternative through zero-shot classification, but researchers lack accessible, desktop-based interfaces that integrate codebook management, batch processing, and reliable output parsing.
+A **correlogram** tool visualizes pairwise co-occurrence between two sets of codes as a heatmap, and a **data preparation** module enables stratified random sampling to create representative coding subsets for pilot studies.
 
-CodebookAI fills this gap by providing a turnkey solution for LLM-powered deductive coding. It targets qualitative researchers who are domain experts but not programmers, enabling them to leverage OpenAI's models without writing prompts or API calls. The tool's batch support addresses scale limitations of interactive APIs, making it practical for real research projects with thousands of segments.
+## Statement of Need
 
-## Design choices {#design-choices}
+Deductive qualitative coding is a methodological cornerstone of qualitative content analysis [@mayring2000] and systematic qualitative inquiry more broadly. The process requires trained coders to evaluate each text segment against a codebook, and typically involves two or more coders to establish inter-rater reliability. For datasets of several hundred to several thousand items, this represents a significant investment of researcher time—often weeks per study.
 
-CodebookAI is implemented in Python using [list GUI framework if applicable, e.g., Tkinter/PyQt/CustomTkinter] for the desktop interface and the OpenAI Python client for API interactions. The architecture separates concerns into three layers: a GUI layer for user input/output, a service layer for prompt construction and response parsing, and a thin API client that can be mocked for testing.
+Existing computational tools address this challenge inadequately. Keyword-matching approaches fail on semantically complex text. Supervised machine-learning pipelines require labeled training data, data-science expertise, and large datasets before they achieve acceptable accuracy. Commercial qualitative data analysis software (e.g., NVivo, ATLAS.ti, MAXQDA) supports manual coding workflows but does not integrate LLM-based classification. General-purpose LLM chat interfaces such as ChatGPT do not provide the structured outputs, batch processing, or codebook integration that systematic research requires.
 
-Key design decisions include:
-- **Deterministic prompt templates** that incorporate the full codebook and segment context, minimizing LLM hallucination.
-- **Structured response parsing** that validates and maps LLM outputs to codes, with fallbacks for ambiguous or malformed responses.
-- **Batch API integration** for cost-effective, high-volume processing, with job tracking and retry logic.
-- **Offline-first design** where possible, with API calls only for classification.
+LLMs demonstrate strong zero-shot text classification capabilities—assigning labels to previously unseen text without task-specific training. CodebookAI operationalizes this capability in a researcher-facing tool that requires no programming knowledge. It enforces codebook fidelity through a JSON schema derived at runtime from the user-defined codebook, so the model is structurally prevented from producing labels outside the allowed set [@openai2024]. The **batch processing** mode (via OpenAI's Batch API) reduces costs by up to 50% compared to synchronous calls and scales to tens of thousands of segments per job with results typically available within 24 hours. The integrated reliability module closes the qualitative workflow loop by enabling immediate quantitative comparison of LLM-generated codes with human-generated codes.
 
-The application avoids external dependencies beyond OpenAI and standard Python libraries to ensure reproducibility across researcher environments.
+## Functionality
 
-## Functionality {#functionality}
+Users launch CodebookAI without installing dependencies beyond the single-file Windows executable (or a standard `pip install` for source users). After entering an OpenAI API key in **File → Settings**, the workflow proceeds as follows:
 
-Users define a codebook by specifying category names and descriptions, then import text segments from CSV or TXT files. CodebookAI constructs prompts combining the codebook, segment text, and instructions for single-label classification, submits via OpenAI's batch API, and retrieves results after processing (typically 24 hours).
+**Classification (batch or live).** A file import wizard accepts CSV, TSV, Excel, and Parquet files. The user selects the column containing text segments; a second import selects the column containing codebook labels. CodebookAI builds a Pydantic model at runtime that treats the label list as a strict enumeration, derives a JSON schema from it, and passes the schema to the OpenAI API as a structured-output constraint. In batch mode, all requests are bundled into a single JSONL file and submitted to OpenAI's Batch API; the main window displays job status and allows one-click result retrieval. In live mode, segments are classified sequentially with a progress bar. Results are exported as CSV.
 
-Results are displayed in a spreadsheet-like view with original text, predicted code, confidence (if available), and LLM rationale. Users can review, edit, override predictions, and export to CSV for further analysis. Error logs and partial results ensure no data loss from API issues.
+**Multi-label classification** follows the same workflow but uses an array-typed JSON schema, requiring at least one label per segment.
 
-## AI usage disclosure {#ai-usage-disclosure}
+**Keyword extraction** submits each text segment with a prompt requesting a list of key terms, returning results in the same batch or live modes.
 
-GitHub Copilot was used extensively during software development, generating nearly all front-end code and portions of the back-end implementation. The core back-end classification logic and overall architecture were authored by the human developer. All Copilot-generated code was reviewed, validated, debugged, and integrated by the human author, who takes full responsibility for the software's correctness, licensing, and performance.
+**Reliability analysis.** The inter-rater reliability module accepts two coded datasets and joins them on a shared text column. It computes the number of matched rows, percent agreement, and Cohen's kappa over the union of label sets, then exports a two-sheet Excel workbook with summary statistics and per-row agreement flags.
 
-Perplexity AI (powered by Grok 4.1) was used to draft sections of this paper and assist with JOSS submission requirements. GitHub Copilot was also used to generate automated tests, CI workflows, and the `CONTRIBUTING.md` file. All AI-generated content was reviewed, substantially revised for accuracy, and validated by the author.
+**Correlogram.** Given two coded datasets, this tool computes a cross-tabulation matrix between the label columns and renders it as a customizable heatmap (with options for row, column, or global normalization, colormap selection, and cell annotation).
 
-## Acknowledgments {#acknowledgments}
-(optional — add if relevant)
+**Data preparation.** A random sampler draws a user-specified number of rows from any imported dataset, enabling researchers to create pilot subsets before committing to full-dataset API costs.
 
-## References {#references}
-(optional — cite key dependencies or prior work)
+## AI Usage Disclosure
+
+GitHub Copilot was used extensively during software development, generating the majority of front-end code and portions of the back-end implementation. The core classification logic, structured-output schema generation, and overall architecture were authored by the human developer. All Copilot-generated code was reviewed, validated, debugged, and integrated by the human author, who takes full responsibility for the software's correctness, licensing, and performance. GitHub Copilot was also used to generate automated tests, CI workflows, and the `CONTRIBUTING.md` file. All AI-generated content was reviewed and substantially revised for accuracy by the author.
+
+## Acknowledgments
+
+The author thanks the students and research collaborators at Kettering University whose qualitative research needs motivated the development of this tool.
+
+## References