the 10 best platforms to find free datasets and how to tell if they are of good quality

22
THE 10 BEST PLATFORMS TO FIND FREE DATASETS www.newsdata.io

Upload: newsdata_io

Post on 01-Nov-2021

0 views

Category:

Software


0 download

DESCRIPTION

There is plenty of free data out there, ready to be used for school projects, market research, or just for fun. Before you go crazy, however, you should be aware of the quality of the data you find. Here are some great sources of free data and some ways to determine their quality.

TRANSCRIPT

Page 1: The 10 best platforms to find free datasets and how to tell if they are of good quality

THE 10 BEST PLATFORMSTO FIND FREE DATASETS

www.newsdata.io

Page 2: The 10 best platforms to find free datasets and how to tell if they are of good quality

If “the data is the new oil” then there is a lot of free oil just waiting to be used. Andyou can do some pretty interesting things with that data, like finding the answer tothe question: Is Buffalo, New York really that cold in the winter?

There is plenty of free data out there, ready to be used for school projects, marketresearch, or just for fun. Before you go crazy, however, you should be aware of thequality of the data you find. Here are some great sources of free data and some waysto determine their quality.

All of these dataset sources have strengths, weaknesses, and specialties. All in all,these are great pieces of equipment and you can spend a lot of your time diggingrabbit holes.

But if you want to stay focused and find what you need, it’s important to understandthe nuances of each source and use their strengths to your advantageNewsdata.io API

Page 3: The 10 best platforms to find free datasets and how to tell if they are of good quality

As the name suggests, Google Dataset Search is “a dataset search engine,” whoseprimary audience includes journalists and data researchers.

Google Dataset Search has the most datasets of any options listed here, with 25million datasets available when it exited beta in January 2020. As it comes to aGoogle product, the search function is powerful, but if you have to be really specific, ithas plenty of filters to narrow down the results.

When it comes to finding free public datasets, you can’t do much better than GoogleDataset Search right now. Keep in mind that the Google Graveyard, which is aphenomenon where Google cancels a service or product on short notice, is apervasive danger to Google products large and small. It is good to know the otheroptions.

1. Google Dataset Search

Page 4: The 10 best platforms to find free datasets and how to tell if they are of good quality

Kaggle is a popular data science competition website that provides free publicdatasets that you can use to learn more about artificial intelligence (AI) and machinelearning (ML).

Organizations use Kaggle to display a prompt and # 40, as cassava leaf diseaseclassification and # 41; and teams from around the world will compete against eachother to solve it using algorithms (and win a cash prize).

Kaggle is quite prominent in the data science community because it provides a way totest and demonstrate your skills — your performance in the Kaggle competitionsometimes shows up in job interviews for AI / ML positions.

2. Kaggle

Page 5: The 10 best platforms to find free datasets and how to tell if they are of good quality

After these competitions, the datasets are made available for use. At the time ofwriting, Kaggle has a collection of over 68,000 datasets, which he organizes using asystem of tagging, usability scores, as well as positive reviews and negative.

Kaggle has a strong community on their site, with discussion boards within eachdataset and within each competition. There are also active communities outside ofKaggle, such as r / kaggle, which share tips and tutorials.

All of this is to say that Kaggle is more than just a free dataset distributor; it’s also away to test your skills as a data scientist. Free datasets are a side benefit that anyonecan take advantage of.

Page 6: The 10 best platforms to find free datasets and how to tell if they are of good quality

GitHub is the global standard for collaborative and open-source online coderepositories, and many of the projects it hosts have datasets you can use. There is aspecific project for public datasets aptly called Awesome Public Datasets.

Like Kaggle, the datasets available on GitHub are a side benefit of the site’s realpurpose. In the case of GitHub, this is primarily a code repository service.

This is not a data repository optimized for discovering datasets, so you might need toget a little creative to find what you’re looking for, and it won’t have the same varietyas Google or Kaggle.

3. GitHub

Page 7: The 10 best platforms to find free datasets and how to tell if they are of good quality

Many government agencies make their data freely available online, allowing anyoneto download and use public datasets. You can find a wide variety of government datafrom municipal, state, federal, and international sources.

These datasets are great for students and those focusing on the environment, theeconomy, healthcare (a lot of these types of data due to COVID19), or demographics.

Keep in mind that these aren’t the most stylish sites of all time — they are mostlyfocused on function rather than style.

4. Government Sources

Page 8: The 10 best platforms to find free datasets and how to tell if they are of good quality

FiveThirtyEight is a data journalism website that occasionally makes its datasetsavailable. Their original focus was sport but has since spread to pop culture, scienceand (most famous) politics.

The datasets made available by FiveThirtyEight are highly organized and specific totheir journalistic production. Unlike the other options on this list, you’ll likely end upbrowsing inventory rather than searching.

And you might come across some fun and interesting data sets, like 50 years of aWorld Cup doppelganger.

5. FiveThirtyEight

Page 9: The 10 best platforms to find free datasets and how to tell if they are of good quality

Data.world is a data catalog service that simplifies collaboration on data projects.Most of these projects make their datasets available free of charge.

Anyone can use data.world to create a workspace or a project that hosts a dataset. Awide variety of data is available, but it is not easy to navigate. You will need to knowwhat you are looking for to see results.

Data.world requires login to access their free community plan, which allows you tocreate your own projects / datasets and provides access to others’ projects / datasets.You will need to pay to access multiple projects, datasets, and repositories.

6. Data.world

Page 10: The 10 best platforms to find free datasets and how to tell if they are of good quality

Newsdata.io is a news API and they collect worldwide news data on a daily basisand they offer that news data with their news API.

They also provide free news datasets and the best is that you can also make a newsdataset according to your requirement with the help of Newsdata.io news API inpython, which may take longer when you are fetching large sums of data.

7. Newsdata.io news datasets

Page 11: The 10 best platforms to find free datasets and how to tell if they are of good quality

Amazon makes large datasets available on its Amazon Web Services platform. Youcan download the data and use it on your computer, or analyze the data in the cloudusing EC2 and Hadoop via EMR. You can read more about how the program workshere.

Amazon has a page that lists all the datasets to browse. You will need an AWSaccount, although Amazon does provide you with a free level of access for newaccounts that will allow you to explore data at no cost.

8. AWS Public Data sets

Page 12: The 10 best platforms to find free datasets and how to tell if they are of good quality

Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains anastonishing expanse of knowledge, with pages on everything from the Ottoman Warsof the Habsburgs to Leonard Nimoy.

As part of Wikipedia’s commitment to the advancement of knowledge, they offer allof their content free of charge and regularly generate dumps of all articles on the site.In addition, Wikipedia offers a history of changes and activities, which allows you tofollow the evolution of a page on a topic over time and to know who contributes to it.

You can find different ways to download the data on the Wikipedia site. You will alsofind scripts to reformat the data in various ways.

9. Wikipedia

Page 13: The 10 best platforms to find free datasets and how to tell if they are of good quality

The UCI Machine Learning Repository is one of the oldest sources of datasets on theweb. While the datasets are user-supplied and therefore have varying levels ofdocumentation and cleanliness, the vast majority are clean and ready to apply. UCI isa great first stop when looking for interesting datasets.

The data can be downloaded directly from the UCI Machine Learning repository,without registration. These datasets tend to be quite small and don’t have a lot ofprojects/datasets nuances, but they are useful for machine learning.

10. UCI Machine Learning Repository

Page 14: The 10 best platforms to find free datasets and how to tell if they are of good quality

Free data is great, High-quality freedata is better. If you want to do a greatjob with the data you find, you need todo your due diligence to make sure it’sgood quality data by asking a fewquestions.

Quality data gives you quality work

Newsdata.io API

Page 15: The 10 best platforms to find free datasets and how to tell if they are of good quality

Newsdata.io API

Should I trust the data source?First, consider the overall reputation of your data source. Ultimately, datasets arecreated by humans, and those humans may have specific agendas or biases thatcan translate into your work.

All of the data sources we have listed here are reliable, but there are several datasources that are not as reliable. The only downside to our listing here is thatcommunity-provided collections, such as data.world or GitHub, may vary in quality.If you have doubts about the reputation of your data source, compare it with similarsources on the same topic.

Page 16: The 10 best platforms to find free datasets and how to tell if they are of good quality

Newsdata.io API

Could the data be Incorrect?Next, examine your data set for any inaccuracies. Again, humans create thesedatasets and humans are not perfect. There may be errors in the data which, using afew quick tips, you can quickly identify and correct.

First tip: calculate estimates for the minimum and maximum for any of your columns.Check if the values in your dataset are outside of this using the filtering and sortingoptions, shown here:

Let’s say you have a small data set on used car prices. You would expect the pricedata to be somewhere between $ 7,000 and $ 20,000 or so. When you filter theprice column from low to high, the low price probably shouldn’t be very far from $7,000.

Page 17: The 10 best platforms to find free datasets and how to tell if they are of good quality

Newsdata.io API

But humans can make mistakes and enter data incorrectly: Instead of $ 11,000.00,someone can type $ 1,100.00 or $ 11.00.00. Another common example is thatsometimes people don’t want to provide actual data for things like phone numbers.You can get a lot of 9999999999 or 0000000000 in these columns.

Also, pay attention to the column headings. A field can be titled “% occupied” andthe entries can have 0.80 or 80. Both could mean 80% but would show up differentlyin the final data set.

Then check for errors. If these are simple and obvious mistakes, correct them. If theyare clearly incorrect, remove the entry from the dataset so that they do not collapse.

Page 18: The 10 best platforms to find free datasets and how to tell if they are of good quality

It is very common for a dataset to run out of data. Before you start working with thedataset, it is a good idea to check for null or missing values. If there are a lot of NULLvalues, the dataset is incomplete and may not be good to use.

In Excel, you can do this by using the COUNTBLANK function, for example,COUNTBLANK (B1: B3) in the following image gives a number of 1.

Too many zero values probably mean an incomplete data set. some null values, butnot too many, you can pass and replace null values with 0 using SQL, or you can do itmanually.

Could the Data Be Unfinished?

Newsdata.io API

Page 19: The 10 best platforms to find free datasets and how to tell if they are of good quality

Understanding how your data set is asymmetric will help you choose the right data toanalyze. It’s helpful to use visualizations to see how skewed your dataset is, as it’snot always obvious by just looking at the numbers.

For numeric columns, use a histogram to see the type of distribution of each column(normal, left, right, uniform, bimodal, etc.).

Strict recommendations of what to do next based on the dataset, but overall the wayit is biased will give a general idea of the quality of the data and suggest whichcolumns to use in the analysis. You can then use this general idea to avoidmisrepresenting the data

How to know if the data is skewed?

Newsdata.io API

Page 20: The 10 best platforms to find free datasets and how to tell if they are of good quality

For non-numeric columns, use a frequency table to seehow many times a value is displayed. In particular, youmight want to check if there is mainly a value present.If so, your analysis may be limited due to the lowdiversity of values. Again, this is just to give you ageneral idea of the quality of the data and indicatewhich relevant columns to use.

You can create these visuals and frequency tables inExcel or Google Sheets using CSV, but you might wantto turn to a Business Intelligence (BI) tool for complexdata sets.

Newsdata.io API

Page 21: The 10 best platforms to find free datasets and how to tell if they are of good quality

Once you have your data and are confident in its quality, it’s time to put it to work.You can go a long way with tools like Excel, Google Sheets, and Google Data Studio,but if you really want best practices for your career data, you need to be familiarwith the real deal: a BI platform.

A BI platform will provide powerful data visualization capabilities for any data set,from small CSVs to large data sets hosted in data warehouses, such as GoogleBigQuery or Amazon Redshift. You can play around with your data to createdashboards and even collaborate with others.

Use free datasets

Newsdata.io API

Page 22: The 10 best platforms to find free datasets and how to tell if they are of good quality

Newsdata_io

Newsdata.io

Newsdata_io

Newsdata.io