The Dataset Relay—Crowdsourcing Food Informatics Data
The Data Void
The Decaying Cycle needs a high-quality dataset of Indian packaged foods to even begin training. No such thing exists. Open Food Facts India has 10k+ entries, but coverage is spotty, ingredients unnormalized, nutrition tables incomplete.
Open Food Facts. (2024). *Open Food Facts India database reaches 10K product milestone.
Static scrapes hallucinate or miss the messy reality of labels. Manufacturers list “edible vegetable oil” without specifics. Sodium might be “0” or absent. No provenance tracking what data came from where.
The Relay Hypothesis
What if we treat dataset-building as a relay race, not a solo sprint?
Single-track datathon on Kaggle. Teams submit two things:
- Clean dataset following minimal required schema:
product_id,brand,product_name,ingredients_raw+ingredients_canonical, core FSSAI nutrients per 100g (energy, protein, fat, carbs, sugars, sodium),source_url,collection_date,ai_assisted_flags.
Schema is hybrid: fixed core fields ensure everything merges cleanly, teams can add their own extra columns however they want. Open Food Facts. (2022). Taxonomies introduction.
- Research log notebook documenting day-by-day: sources tried, standardization approaches, dead ends encountered, quality checks implemented.
Judging evaluates final submissions holistically—strategy transparency and data rigor carry highest weight. Top pipelines merge into one canonical dataset. All teams get co-authorship on the research paper documenting the merged dataset and collection methods.
Why This Might Work
Teams learn real-world data wrangling: scraping messy labels, standardizing ingredient names, designing quality checks. Plus CV signals: named co-author on research paper + published dataset. No endorsers, prize committees, or complex logistics—just run and publish the merged output.
Engelhard, C. L., et al. (2023). Citizen science approaches to crowdsourcing food environment data: A scoping review of the literature. Obesity Reviews. Learns from citizen science: simple capture tools + explicit validation loops scale better than pure automation.
Forces provenance tracking per row, breaking opaque "trust me" datasets plaguing nutrition Kaggle sets.
Batthula, V. (2025). Indian Food Nutritional Values Dataset.
One team might discover a stupid-simple trick (barcode lookup first? photo-OCR with custom heuristics?) that unlocks what solo research misses.
Status: Datathon Design
Next Steps: Draft Kaggle competition page.
My deepest gratitude to Mr. Krishna, whose constancy forms the foundation upon which all my work, including this, quietly rests. Salutations to the Goddess who dwells in all beings in the form of intelligence. I bow to her again and again.