Earlier this month, I submitted a solution for a toy problem at Insight Data Engineering Fellowship. The problem was to write a scalable program to generate some statistics from large CSV files containing Instacart product sales information. More about the problem can be found in this GitHub repository.
The dataset contained a file named
order_products.csv containing around 3 million Instacart sales orders, and a file named
products.csv with detailed information of each product. The challenge was to use the product-department relationship detailed in the
products.csv file, to map sales data tabulated in the
order_products.csv file to product departments/categories, and aggregate the sales stats by categories.
I have tried to solve the problem using two different approaches. The first approach uses a single threaded process, and is memory efficient, but takes a long time to finish. The second approach uses multiple processes running in parallel, and is time efficient, but assumes that the system has enough free space in secondary storage. Both of the approaches reads the data in row by row to make sure the solution scales for large data, and use Python dictionaries as the underlying data structure to store relationship maps and resulting statistics for fast access. The first approach is implemented in the DeptOrderStat class. The second approach is implemneted in the DeptOrderStatMP class, which extends the DeptOrderStat class.
The assumption behind the parallel solution is that the machine has enough free space to store some temporary files (equals the size of the input files). The parallel solution splits the order request file into separate files, process each file using a separate process, and then consolidates the results. For large input files, this reduces the time required to generate the report by a linear factor of m, where m is the number of cpus in the system.
Interface for both the classes are same. To run the parallel solution just pass
--mp after the script name.
To run the parallel solution use: python3 report_generator.py –mp
To run the sequential solution use: python3 report_generator.py
You can find the solution code at my GitHub repository.