Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.


Computer Engineering and Computer Science

Degree Program

Computer Science and Engineering, PhD

Committee Chair

Badia, Antonio

Committee Co-Chair (if applicable)

Khalefa, Mohamed

Committee Member

Khalefa, Mohamed

Committee Member

Altiparmak, Nihat

Committee Member

Frigui, Hichem

Author's Keywords

NoSQL; document stores; database schema; JSON


Query optimization in document stores has traditionally relied on rule-based approaches, but recent research advocates for a shift towards cost-based optimization. However, this transition is hindered by the fragmented nature of existing approaches, stemming from the early development stage of cost-based query optimization for document databases. A key challenge lies in the absence of a standardized query language and semantics, exacerbated by the diverse and schema-less nature of JSON document collections. To tackle these challenges, the literature has proposed dynamic schemas, primarily utilized at parsing time. However, these schemas lack a formal foundation that describes meaningful semantics for query optimization. This thesis proposes a novel framework based on a relational-like plan, employing an algebra to internally represent queries. By manipulating algebra expressions, multiple plans are generated and subsequently evaluated for cost. Specifically tailored to JSON data, the thesis introduces a document algebra designed to accommodate JSON characteristics. Additionally, it formalizes a dynamic schema concept termed Data Pilot, inspired by XML DataGuides. An algebra over Data Pilots is presented, facilitating cardinality estimation without executing operations, aiding in query optimization. Furthermore, the thesis proposes a strategy to determine when query rewriting using Document Algebra properties may be advantageous. Experimental validation demonstrates the feasibility of the proposed framework and showcases the construction of Data Pilot structures. Through this research, a step towards standardized, cost-based query optimization in document stores is taken, paving the way for more efficient and scalable query processing in the future.