My name is Mark Pfeiffer and I’m the Co-Founder and CTO of SiaSearch (https://www.siasearch.io/), a Berlin-based AI startup that provides a data management tool for engineers working on self driving cars and other computer vision applications.
Our Head of Product Armaghan Khan and I will be hosting an AMA later today, April 8 at 12pm MT. We’re eager to hear any questions you might have about:
🖼 Computer vision model development
🤖 Autonomous vehicles
💾 Data bottlenecks in machine learning
💾Data selection & curation
Looking forward to an interactive session!
Also, is this tool targeted just at engineers working on computer vision applications or is that just the largest side of the market?
Currently the tool is quite tailored to the computer vision use case. A lot of other ML applications deal with more structured data which is also challenging but easier to handle and select. For computer vision applications the data selection is particularly hard as the content of images is hard to access and therefore a lot of manual screening is required to select the right data. Currently we see the largest value add of SiaSearch in this application and focus on that with our team!
One thing I've read about with ML/AI, is that it is difficult to trace back the why a model came up with its final verdict or weights. Like that it can be hard to debug the layers/network to know why, as an example, the image of the speed limit sign with tape over a part of it made the model determine the number it saw. I imagine users want to know which parameters to tweak (without too much trial and error) that'll nudge the probabilities in the right direction.
Is this something your software addresses or is this an ongoing challenge in the field?
Also adding an example regarding your question: You might realize that your model has problems to detect traffic lights under sunny conditions while it works well in the dark or rain. This is an interesting insight and tells you that you should probably get more data of sunny intersections with traffic lights annotated. So as a summary, improving model performance is a core element of SiaSearch, however we rather look at it from an I/O perspective rather than raw model weights.
For data driven development, of which computer vision and self-driving are a subset, there are just very few tools so far that make the work of developers simple and easy, we wanted to change that. We envision a future where building data driven products is as easy as building software today.
And I believe the industry is currently taking on a similar perspective. During a recent conference, Andrew Ng urged developers and companies to take on a more data centric approach to ML
(https://scale.com/events/transform/videos/big-data-to-good-data?validation=big-data-to-good-data) One of the big challenges to do so is better tooling, and that is our mission! 🛠 🛠 🛠
1. Intelligent algorithms are applied to extract useful information e.g. whether the car was making a turn, what was the weather like, how many people were in view
2. This information (which we call metadata) is populated into a proprietary database which allows super fast queries on PB scale data
3. To make it super easy for the user an SDK and GUI interface is provided, where they can easily search, select and visualize data as needed
You can dive into more depth https://www.siasearch.io/product and can also experience the product for yourself here: https://public.sia-search.com/
Would love to know more about how you are able to improve consumer experiences.
The most popular approach to self checkout technologies involves the use of multiple cameras. Using the video feeds the self-checkout software stack recognizes inventory, buyers and can associate the two. Naturally these algorithms need data to be trained, which is where SiaSearch comes in. Using our product a developer can easily get a subset of situations e.g. a buyer fetching a yoghurt pack from the refrigerator. They can use this subset to train the right model and improve their performance quicker.
That's super interesting. So in a way, SiaSearch, can provide some of the initial annotation itself without the need for human annotators?
If so, I'd see that as a huge value-add. Have you been marketing it as both an automatic annotation platform + data management platform?
1. Without SiaSearch: send all data for human annotation i.e. time and cost intensive
2. With SiaSearch: extract the left turns and only get those portion annotated i.e. faster and cheaper
If not, is that a feature you're looking to add in the future or have you purposefully stayed away from that feature as not to compete with the already existing tools?
Since your company has worked to solve the data filtering problem, how long do you think it'll be before we are able to solve the data annotation problem? When do you think we'll have algorithms that can annotate data as well as humans can? Now that the training data industry has become quite huge, with millions around the world contributing to data annotation projects, I imagine the answer to that question could change the entire industry.