Napa: Google’s Internal Data Warehouse for AI Advertising Clients
Napa is Google’s internal data warehouse that powers the infrastructure for Google Ads. It handles billions of reporting queries used by advertising clients to measure campaign performance. These queries rely on tables stored in Napa, which contain records of ad performance associated with specific customers and campaign identifiers.
The challenge with processing these queries is that the data is skewed, meaning some queries require millions of records while others need only a few. To address this, we’ve developed a progressive query partitioning algorithm that parallelizes query execution and meets strict latency targets.
When a client inputs a reporting query, the main challenge is determining how to effectively parallelize the query. Napa’s parallelization technique breaks up the query into sections distributed across available machines, reducing query latency. However, estimating the number of records associated with a key is not perfect and can result in runtime skews and poor performance.
To manage the data deluge in Napa, we use log-structured merge forests (LSM tree) to organize table updates. LSM allows us to update tables separately from query serving by atomically updating the data once the next batch of ingest is ready.
The data partitioning problem in Napa involves a massively large table represented as an LSM tree. Our progressive partitioning algorithm splits the trees into equal parts based on estimates and reduces the error margin with each traversal step. This algorithm ensures that the partitioning process stops when the desired error margin is reached, guaranteeing approximately equal pieces.
The progressive partitioning algorithm utilizes statistics stored with each node of the tree to guide its moves and reduce the error estimate quickly. It is conducive for our use-case as the longer it runs, the more equal the pieces become. Additionally, even if the algorithm is stopped at any point, it still provides good partitioning quality.
Compared to prior work that uses sampled tables, our tree-based partitioning method achieves more efficient partitioning. Progressive partitioning is a crucial component of Napa’s ability to serve billions of queries every day, making it a powerful tool for Google Ads infrastructure.
In conclusion, Napa is Google’s internal data warehouse that handles billions of reporting queries for advertising clients. Our progressive partitioning algorithm ensures efficient query execution and reduces latency while accommodating data skews. With Napa, Google Ads infrastructure can effectively serve billions of queries and provide valuable insights to advertising clients.