Wednesday, May 16, 2018

Kaggle: reconstruct tracks from 75 GB of point data

Fartel Engelbert has told me that there is a new CERN-sponsored machine learning contest at
TrackML Particle Tracking Challenge
To make the story short, the data you will have to download include 5 times 15 GB train files plus 1 GB train sample and 1 GB test file. A sample submission has 30 MB, have 175 kB.

Well, readers whose infrastructure is similar to mine have already given up. I don't know what to do with 75 GB. On Windows, there's no trouble to store this much data but I would have to manipulate it with Mathematica and that would clearly be too slow with 75 GB.

On the other hand, I could run a VirtualBox with some Linux, like during the Higgs Kaggle contest, but then I would have to study whether I have to allocate some extra hard disk for the simulated Linux hard disk and face similar problems that I am not experienced with. I just don't want to do that – this dataset is simply too big for me.

If such things aren't trouble for you, you should try. In the first phase of the contest – three months are left – you need to design the most accurate algorithm to reconstruct particles' tracks from the points that the huge datasets are composed of.

There will be another part of the contest that starts in the summer where the speed of the calculation will matter.

The leaderboard shows the first contestant among 222 to have the score of just 0.46 – so I believe that there's a lot of room for improvement. The preliminary leaderboard is based on some 29% of the data, the final one will be based on the remaining 71% of the data so it may be different.

Most importantly, the prizes are $12k, $8k, $5k (in total, $25k) for the first, second, third place.

Good luck.

No comments:

Post a Comment