As of writing this (Sept 11, 2018), the Google Dataset Search is still in Beta.
Google has released a tool to search for datasets. Pretty much like you would search for something on Google Search, you can now do the same but for a dataset.
Check out the search tool here: Google Dataset Search.
On their blog, Google mentions how their vision is to make it easier to discover datasets. I think that anyone who has ever written a scientific paper or a dissertation can agree that finding a dataset and also the correct one, is a tedious task to undertake.
Google already has Google Scholar, which is a search function when you’re looking for scientific articles. Articles that you could use to cite or quote in your own work. They’re also great when you want to reference something. If a scientific article is peer reviewed or refereed you have the quality control in place. At this point you can most likely use the source as a way to support your arguments.
Now, words are great and all, and we have read words for thousands of years and will most likely continue to do so. BUT statistics and visual such, usually drives a point home. If you have strong arguments backed by peer reviewed studies and you can showcase it with actual data, it’s often pretty difficult to argue against.
Who should use the Dataset Search?
“To enable easy access to this data, we launched Dataset Search, so that scientists, data journalists, data geeks, or anyone else can find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.”
I think Google sums it up perfectly. But it can also be used for AI and deep learning. At one of my employers we wanted to see how weather (if all) affected sales on a daily basis. We built a robust model, only to find out that the dataset that we required (a very local one) was not to be found. At least not for free. The data available at the time was behind a paywall bigger than the Great Wall of China… Finding a reliable dataset for free is like a blessing.
Development of datasets
If Google pulls this off it would also mean that researchers and data collectors would begin to move to a standardized way of presenting dataset. Meaning that less time can be spent looking for and interpret data to a model, and more time spent analyzing it.
I believe in this because if we look at what Google has done with their search engine in general where a lot of websites today are being modified and developed in accordance with Google’s SEO standards. Google said that they want to “make data discoverable, but keep it where it is”. Meaning that they do not want to store all datasets in their own database, but rather be a search engine for those sets. Resulting in anyone who wants their dataset to be seen or used will have to adopt to the Google Dataset Search guidelines.
What are some pitfalls?
As with all statistics it has to be checked. Google says that they have guidelines in place for those who wish to take part in the dataset search base, such as who created the dataset, when it was published and how the data was collected etc.
This is great, but as a user of this data you need to be vary of how it was collected.
A dataset on political opinions collected by a newspaper whose readers are 80% liberals, should not be an applicable dataset to the whole population.
Finding more data
I will begin to use Google Dataset Search in my work. But it is also worth mentioning that Reddit has its own community of datasets. You can also visit Kaggle where you’ll find a lot of datasets as well.
And remember correlation does not equal causality…