DSI DataFest 2018
September 14 - 21
The DSI is holding a DataFest! We'll provide the problem and the data, and you’ll work in small teams to explore interesting approaches and solutions. UC Davis students, postdocs, and staff with all skill levels are encouraged to participate. Alumni, community members, and other UC and CSU students should contact us to see if we have space.
Sherry Lehmann is a very well-known and long established seller of wine and other beverages. The UC Davis Library has over 200 scanned “wine” catalogs created by Sherry Lehmann spanning a 30 year period, containing over 4,000 pages. How can we best recover information about wine prices over time from these scans so it can be used by researchers, archivists, wine historians and enthusiasts? The Data Science Initiative is actively helping with this project and we challenge you to find creative ways to solve this problem and correctly extract the data from the catalogs.
This data recovery project combines:
- optical character recognition (OCR)
- image processing
- computer vision
- text processing
- various statistical and machine learning methods
- common sense and creating problem solving!
The event kicks off on Friday, September 14th with an info session to frame the problem and form teams. Come and work with your friends, or we'll help you form a team. Teams are expected to be between 2-4 people.
Group work days are September 18-21 (10am-2pm each day). Your entire team is expected to work in the DSI during those days and times, and we’ll provide you with coffee, tea and snacks to keep you going. But you aren’t restricted to working on the project only during those times - you can stay and work in the DSI or library.
All projects must be completed by noon on September 21, and teams will present their results that afternoon.
SHOULD I PARTICIPATE?
This DataFest is a great opportunity to:
- work on an important, relevant Data Science project
- develop a project in your portfolio to show employers, etc.
- work in teams and develop teamwork skills
- learn about OCR
- use workflow best-practices
- problem solve!
There are many ways to contribute - across different subtasks and at all skill levels - so don't be shy or hesitant to participate. Everyone's welcome and useful. Subtasks include developing truth-sets, testing results, integrating data, identifying patterns in the images, developing ideas for extraction approaches, programming, and fitting models. And there are lots of smaller sub-projects, so "finishing" your project doesn't necessarily mean recovering all the data!
- Friday, Sept. 14th (11.30am - 12.30pm): Introduce the problem and data, answer questions, form teams.
- Tuesday, Sept. 18th - Thursday, Sept. 20th (10am - 2pm): Work in the DSI. You're welcome to work in the DSI before and after this period.
- Friday, Sept. 21st (10am-2pm): Finish the project in the morning. Teams present their results over lunch. Awards will be announced after the products are reviewed by the DSI and Library.
Data Science Initiative classroom (room 360 Shields Library), Davis campus.
Each team must present a short summary of their findings to the other teams and a panel of judges comprised of staff and faculty from the DSI and Library. Prizes are TBD. Teams will be judged based on their creativity, insight, and communication. Separate prizes will be awarded for technical solutions and conceptual approaches, so everyone - regardless of incoming skills - has a chance to succeed.