I want to share with you a web-enabled community spell-checker dictionary idea I had this morning.
The technical details may bore you, but the simple description should help generate interest. How many times have you written a word that you knew was spelled correctly, like “blog”, and had your spell-checker tell you it’s wrong? What do you do? You can either ignore the “error” and leave it there, with the squiggly red line under it, or you can add the word to your user dictionary (and in a lot of cases, like new Internet words, that means hoping you didn’t spell it wrong).
The user dictionary, I’m sure you know, is simply a database of words stored on the local machine (your computer) that is compared against each word you type in that program. A separate user dictionary is generated by each individual spell-checking program you use (the word processor, the web browser, etc.), and there are no safeguards in place to prevent you from adding a wrong word to one of your user dictionaries. Have you ever tried to go in and remove an incorrect word from a user dictionary? I have. It’s not any fun. Plus, adding a word to one user dictionary doesn’t add it to another one so if you frequently use a new word, you’ll end up being told it’s spelled wrong by all of your spell-checkers until you add it to all of their dictionaries.
What if you could download one program that would check all of you spelling in every program and website? What if that program was linked to other computers running the same spell-checker so it could collect data on misspelled and unknown words from a large number of people and figure out which words belong in the dictionary and which ones really are just spelled wrong?
Such a program could easily exist with current technologies, but as far as I know it remains only an idea in my head. Just imagine a dictionary that maintains itself updated with all of the newest, correctly-spelled vocabulary! A service could be offered to export the master dictionary to other spell-checkers’ native formats as a download on-line for people who don’t want to use the actual spell-checker program but want updated and accurate dictionaries to check their spelling. Periodically, fun statistics could be generated and shared via RSS, including the most frequently misspelled words, most popular words of the day, and a yearly list of new vocabulary generated by progress and technology.
Here is a more technical description of how the program could work:
First, the main functions, in order, would be:
Track and monitor all spelling in all programs on a user’s computer.
Use that data to calculate a score for the user, giving more weight to situations where more people use correct spelling and less weight to situations where less is expected (like chats).
On-line, the program allows users to endorse words that are not in the dictionary but frequently marked as correct. The higher the user’s score is, the more weight his/her endorsement will have.
Words endorsed by enough trusted users are incorporated into the dictionary.
Locally, the dictionary integrates with all programs and tracks user spelling habits, counting each time he/she misspells a word found in the dictionary (the user writes the word, the dictionary says it’s wrong, and the user corrects the spelling) and each time he/she writes a word not found in the dictionary (the user writes the word, the dictionary generates suggestions, and the user selects the option to ignore the misspelling). The particular words misspelled, spelled correctly and unknown to the dictionary are stored in a database on-line.
The program calculates, then, certain statistics for the user based on these numbers. For example, a user may have an overall spelling accuracy of 70% but frequently misspells the same 15 words, though the misspellings only represent 3% of everything the user writes.
The program also tracks where and in what situations the user is using correct spelling, punctuation and structure (capitalization, etc.), giving less weight to the chat sessions with poor structural performance, greater weight to e-mail writing, and the greatest weight to blog entries, Wikipedia articles, local word processing, Google Docs, etc. If a high percentage of people use correct punctuation and spelling in a specific program or at a specific website, the program knows to give a higher weight to the performance of other users in the same situation. If the spell-checking program is unsure of a situation (there is little data about a program’s weight, for example) the weight of the situation is calculated based on the length of the written material. This data would be stored on-line and be incorporated into a central algorithm for calculating a user’s spelling proficiency.
Words frequently not found in the dictionary but deemed by users to be correct (i. e. modern terms) can be reviewed by users with a low frequency of misspellings of known vocabulary, high number of words written per day, and high total proficiency score. Words endorsed by enough trusted users are then automatically added to the central user dictionary database. Admittedly, this is a difficult calculation. How much endorsement would be needed and what percentage of the endorsements would need to be from users with a high score? What constitutes a high score? An algorithm would need to be developed that would permit words to be added to the dictionary without too much delay, but not without first receiving enough endorsement to ensure the word is proper.
I believe that a dictionary maintained by such an algorithm would be invaluable to society. Even current institutions such as dictionary publishers could benefit from such data being collected. The idea could be applied to dictionaries in other languages. This idea represents the movement of dictionary maintenance techniques from the 20th century into the new 21st century era of community efforts and social data.
I think it’s the next logical step. What do you think?
As a final note, I was doing some searching and digging around to see if anyone else has done this or written about it, and I stumbled upon a great way to handle the dictionary database. I also found that programs to check spelling in any application also exist, but I found no mention of a community enabled program collecting data via the Internet to append the dictionary rather than trusting the user when he/she decides to “add to dictionary.”