Dikoka: AI-Powered Document Analyzer for Historical Records

Published:

Dikoka

Dikoka is an AI-powered document analyzer designed to extract key information, generate concise summaries, and suggest follow-up questions to deepen understanding. The project focuses on France’s involvement in Cameroon during the suppression of independence and opposition movements (1945-1971), based on the findings of the Franco-Cameroonian Commission.

Features

  • Document Loading and Processing: Load and process documents from a specified directory.
  • Vector Store Management: Initialize, update, and retrieve documents using a vector store.
  • Hierarchical Summarization: Generate hierarchical summaries of documents.

    This approach allows for organizing content into structured sections and paragraphs, facilitating comprehensive understanding of lengthy documents.

  • Question Dataset: A set of pre-established questions that serves as an example to guide content exploration.
  • Retrieval-Augmented Generation (RAG): Retrieve and generate answers to questions based on document content.
  • Multi Query Splitting: Inspired by LangChain Chat, this feature segments complex questions into sub-parts to improve response accuracy.
  • Chat Interface: An interactive interface for querying the AI assistant.

Approach Rationale

To effectively process voluminous and complex documents, Dikoka adopts several improvements over a naive RAG approach:

  • Hierarchical Summarization: By organizing content into clearly defined sections, this method helps quickly identify essential themes and information, improving readability and overall document comprehension.
  • Question Dataset: Offering a set of sample questions allows users to discover aspects they might not have considered, facilitating learning of new information and in-depth content exploration.
  • Multi Query Splitting: By dividing questions into smaller groups, this technique enables more precise processing of each segment, optimizing the relevance of generated responses.

These combined elements make Dikoka particularly effective for analyzing complex historical documents while offering an enhanced user experience.

Installation

To install the required dependencies, run:

pip install -r requirements.txt

Usage

Running the Hierarchical Summarizer

To launch the hierarchical summarizer, use the following command:

python src/summary/summarizer.py --folder_path <folder_path> --output_folder <output_folder_path>

Running the RAG System

To initialize and launch the RAG system, use the following command:

python app.py

Environment Variables

Configure the following environment variables to set models and other settings:

  • HF_MODEL: Hugging Face model name.
  • USE_OLLAMA_CHAT: Set to 1 to use ChatOllama.
  • OLLAMA_MODEL: Ollama model name.
  • GROQ_MODEL_NAME: Groq model name.
  • USE_HF_EMBEDDING: Set to 1 to use Hugging Face embeddings.
  • OLLAM_EMB: Ollama embedding model name.
  • OLLAMA_HOST: Ollama host URL.
  • OLLAMA_TOKEN: Ollama API token.
  • HUGGINGFACEHUB_API_TOKEN: Hugging Face Hub API token.
  • MAX_MESSAGES: Maximum number of messages to keep in chat history.
  • N_CONTEXT: Number of documents to retrieve in context.

Project Structure

  • src/vector_store/vector_store.py: Manages initialization, update, and retrieval of the vector store.
  • src/utilities/llm_models.py: Provides functions to obtain language models and embeddings.
  • src/utilities/embedding.py: Defines the custom embedding class.
  • src/summary/summarizer.py: Implements the hierarchical summarizer.
  • src/rag_pipeline/rag_system.py: Implements the RAG system for document retrieval and question answering.
  • src/rag_pipeline/prompts.py: Defines prompts for the RAG system.
  • app.py: Main script to launch the chat interface.

Example

To test Dikoka, follow these steps:

  1. Clone the repository:
    git clone git@github.com:Nganga-AI/medivocate.git
    
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Run the hierarchical summarizer:
    python src/summary/summarizer.py --folder_path data/297054_Volume_2 --output_folder data/summaries
    
  4. Launch the chat interface:
    python app.py
    

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License.

GitHub Repository