TweetData cleaning is often one of the more frustrating aspects of a data-intensive project. If you have ever had to import a dataset of size, you will know what I mean. The first project where I needed to clean data from CSV dump containing tens of millions of rows was a particularly good learning experience. After innumerable parsing errors trying to import the file, I realized that the data contained all the common delimiters (commas, tabs, colons, pipes, spaces, etc). I was inexperienced enough that I had not known to use custom delimiters and so all my tools were failing to import the data. It was also here where I learned about the signed integer limit of 65,535 records for many applications. I also learned about the 3.5-gigabyte practical limit of addressable memory on most 32-bit machines limiting the amount of data I could load in at one time.
I have come to appreciate the need for data cleaning and de-duplication tools that have user interfaces which are accessible to non-programmers, but are also capable of handling larger datasets without crashing or making a modern machine grind to a halt. Microsoft Excel is the standard multi-tool for corporations. It is not ideal for many tasks, but it is more than capable of getting the job done. Fields such as email addresses are difficult to validate in Excel in a more than a cursory way. The problem is compounded by the fact that many web forms do not adequately validate input, which can leave datasets with substantial numbers of user input errors. I wrote about this issue in
Validating Email Address in Web Forms—The Hazards of Complexity
It is possible to validate large numbers of addresses by attempting to connect to the mail server for each address, but this is inefficient since it means people end up testing a great many addresses that were simply errors in data collection or improper exporting, etc. It would be far more efficient to analyze the data and remove or fix the clearly invalid addresses before submitting them for testing.
Excel has some built in data cleaning functions, but they are limited. Recent versions of Excel on Windows are capable of handling more than 65535 rows.
ASAP Utilities for Excel on Windows is not dedicated to data cleaning, but it has a number of cleaning features as well as many other handy functions to search, replace, normalize, sort, add and remove formatting. The tool is quite fast, which I appreciate. ASAP Utilities is free for non-commercial use and $49 for commercial use. I recommend it.
WinPure Clean & Match,
WinPure ListCleaner Pro ,and
WinPure ListCleaner Lite are straightforward data cleaning and normalization tools that are primarily targeted at large mailing lists. The tools include features that will be compelling for large list owners such as statistics on missing fields and automated email addresses validation features. This software ranges from $225, which is aimed at smaller list cleaning operations (under 100,000 records) to $1199 for up to 500,000 records in one go. This tool is particularly attractive if you would like to allow non-programmers to larger mailing list datasets.
R10Clean offers a good number of tools to perform standard data cleaning. While it does not have any facilities specifically for email lists, the application does make it straightforward to find duplicate rows, empty strings, to remove extraneous text. The main web page is sparse, which makes it difficult to get a good idea of the feature set. The [R10Clean]Manual (
pdf) offers a better overview. The application is cross-platform and works on Windows, Mac OS X, and Linux. It costs about 50 Pounds (around $80).
XTabulator for Mac OS X is a nice and simple solution to manipulate tabular data. The feature set is fairly functional, but it makes it possible to quickly modify column-based data and it is useful for the first pass at cleaning up mailing lists. For example, the application makes it simple to rapidly move columns around and perform basic autofill. Exporting to a new output format is also simple. In addition to the standard delimited-formats, the application will export to XML, an HTML table, or a SQLite 3 database. XTabulator is $20.