Sketching Big Data: Applying Probabilistic Data Structures and Algorithms

IS Button

Name: Derek Alejandro Hinojosa
Major: Computer Science
Advisor: Kowshik Bhowmik

Digital information has seen explosive growth in recent decades with the expansion of digital devices. The growing network of daily digital users generates a massive influx of data. Sizable amounts can be computationally expensive to store, sort, and query and it becomes infeasible to apply brute-force techniques to perform these operations. Sketching algorithms offer robust solutions to such problems in big data. They process data in constant time only using sub-linear memory. The parent field of such algorithms is probabilistic data structures and algorithms. This advanced class of DSA efficiently estimates results to queries within mathematically provable bounds of error. With the recent development of sophisticated and well-optimized sketching techniques, the error is minimal and often within one percent of expected results.

While these are methods that are commonly integrated into “big data environments,” their efficiency and accuracy are indiscriminate of the size and type of data. For this project, tests with sketching algorithms are performed to examine their accuracy and efficiency in answering queries asked about the data such as “what are the most frequent items?” and “how many are unique?”. Though these questions may seem simple to answer, their solutions become more complicated with larger amounts of data and limited memory. The properties of sketching algorithms are at the base of computer science: can a solution be developed to a problem with a minimal footprint? This project is an exploration of sketching algorithms, particularly two common ones: Count-Min Sketch and HyperLogLog.

Posted in Comments Enabled, Independent Study, Symposium 2022.

One response to “Sketching Big Data: Applying Probabilistic Data Structures and Algorithms”

  1. Hallie says:

    Very cool, Derek! Far beyond my area of understanding, but it looks like a great project. Congrats on finIShing!