We present DocMaster, a hierarchical structure-aware document analysis system. DocMaster parses documents into hierarchical document trees preserving original layouts and constructs a structure-aware semantic index that enables accurate document filtering and in-depth analysis.
DocMaster combines structural and semantic analysis to deliver efficient, accurate document analysis.
Builds a hierarchical tree from structural elements. The LLM traverses top-down, pruning irrelevant branches to minimise token usage.
Extracts cross-chunk semantic relationships as hyperedges. FAISS-based retrieval finds the most relevant hyperedges for relation-aware filtering.
Aggregates evidence from both strategies. Fusing structural and relational signals achieves higher precision and recall than either alone.
Tree-traversal prunes irrelevant subtrees early, significantly reducing LLM token consumption compared to naive full-document retrieval.
After filtering, perform retrieval-augmented Q&A over matched documents using the same indexed embeddings for seamless document intelligence.
Upload an entire collection of documents. The system processes, indexes, and evaluates the filter condition across all documents in a single query.
From document ingestion to semantic filtering and RAG-powered analysis.
End-to-end flow from PDF upload through parsing, indexing, filtering, to RAG Q&A output.
Document tree building, PC-KMeans clustering, and hyperedge extraction pipeline.
Two-column interface with document management on the left and filtering/chat on the right.
Progressive filtering pipeline and system component stack.
Explore →Watch a full walkthrough of the system in action.
Watch →Step-by-step instructions for uploading, filtering, and querying.
Read →Interactive walkthroughs of filtering and Q&A scenarios.
View →