The founding engineer and architect of the Uber data team has created a new system: Onehouse.
Creator of one of the largest data lakes in the world, Uber engineer and architect Vinoth Chandar is behind Onehouse, a new project that has just announced its cloud-native managed service. It is based on open source data management framework, Apache Hudi. The new lakehouse is set to make data lakes easier, quicker, and more cost-effective.
The digital age has brought with it a need for innovation, and data has become a driving force in this area. However, many businesses still have difficulty building and maintaining data frameworks that can economically scale at the fast-paced growth of their data. In addition, the constant increase of AI and machine learning workloads increases costs, resulting in the data being too big for a company's warehouse. This means that companies are forced to use a data lake, which comes with its own set of complex challenges.
Because of his work at Uber, Chandar has experience with these challenges. As the company scaled up and became global, it needed access to more extensive and faster resources to allow for the app's ETAs, food recommendations and ride safety capabilities. Chandar created Apache Hudi to facilitate a new path-breaking architecture that allowed the core warehouse and database functionality to be directly added to the data lake (subsequently, the lakehouse).
Apache Hudi has become an industry-proven project, managing business requirements for several of the world's biggest companies, including Amazon, Walmart, Disney+ Hotstar, GE Aviation, Robinhood, and TikTok. In addition, the open source project has nearly one million monthly downloads.
The application of Apache Hudi to Onehouse will mean access to a cutting-edge data lakehouse with advanced indexes, streaming ingestion services and data clustering/optimisation techniques.
Uber's Zheng Shao and Mohammad Islam shared that they first created the Hudi project in 2016 and submitted it to Apache Software Foundation's Apache Incubator Project in 2019. The project has since become a Top-Level Project, with most of the Big Data on HDFS in Hudi format. This has provided Uber with a cost-effective solution and resulted in a significant decrease in computing capacity needs.
The skills required to design technology like this are hard to come by, and doing it wrong or taking too long risks data not being fresh enough, or unreliability in the performance of the lake.
Onehouse founder and CEO Chandar says, “While a warehouse can just be used, a lakehouse still needs to be built. Having worked with many organisations on that journey for four years in the Apache Hudi community, we believe Onehouse will enable easy adoption of data lakes and future-proof the data architecture for machine learning/data science down the line.
Chandar also comments that Onehouse makes the adoption of the lakehouse framework smooth and straightforward by providing a fully-managed cloud-native service that can swiftly ingest, self-manage and auto-optimise data.
“Instead of creating yet another vertically integrated data and query stack, [Onehouse] provides one interoperable and truly open data layer that accelerates workloads across all popular data lake query engines like Apache Spark, Trino, Presto and even cloud warehouses as external tables,” he adds.
Onehouse allows for the processing of incremental data typically ordered in larger quantities than it would have been for previous methods of batch processing. With its combination of state-of-the-art technology and a fully-managed, user-friendly service, businesses can build data lakes in minutes, a process that may have previously taken months. This means cost savings and the ability for a company to continue to own their data in open formats rather than being locked into individual vendors.
Co-led by Greylock and Addition, Onehouse raised US$8m in seed funding, which the company plans to spend on the lakehouse product and furthering research and development on Apache Hudi.
Greylock partner Jerry Chen says that Chandar's creation represents the future of data lake systems and offers an easy-to-use data warehouse at the cost and scale of a data lake, which greatly benefits customers.
“Apache Hudi is already the de facto starting point for modern data lakes and today Onehouse makes data lakes easily accessible and usable by all customers,” he says.