Category Archives: Data Mining
The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog.
Here’s an excerpt:
4,329 films were submitted to the 2012 Cannes Film Festival. This blog had 18,000 views in 2012. If each view were a film, this blog would power 4 Film Festivals
Google recently came out with a new feature that sorts results by reading level. For example, if you were an elementary or middle school child looking up an information for a school report, you would perform an advanced search and select a basic reading level. There are a few options for refining your results to a specific reading level (see screenshot below).
This is one of the first general internet search engines that offers this function (at least that I’m aware of). There are many other subscription-based databases that offer similar tiered reading level search/refining capabilities, such as EBSCOHost’s Middle Search Plus which uses a Lexile reading level system to rate literary grade levels of literature.
It would be interesting to find out how Google categorizes pages into reading levels. Does it use a controlled vocabulary? If specific words appear a designated number of times or sequence, would be considered advanced? Or perhaps if there is a lack of advanced terminology, such as scientific names used for classification, would be considered basic or intermediate?
And could this somehow be used as a filter? For example, perhaps searches for explicit language or sexual content at a basic reading level would not yield results. While this would be very helpful for parents to help screen objectionable material from their children, it might create a hurdle for adults who are studying a language (perhaps doing a report on a health topic in a second language. This is not the best example since adults tend to be more sophisticated searchers and this should not be much of a barrier for them, but you get my point.
This is a wonderful resource for those interested only in a specific reading level range, such as scientists who want the technical information, or school aged children who might want to exclude the really technical literature. Below are directions from the Google Support Page that lists how to use this new feature.
Features: Reading level
Sometimes you may want to limit your search results to a specific reading level. For instance, a junior high school teacher looking for content for her students or a second-language learner might want web pages written at a basic reading level. A scientist searching for the latest findings from the experts may want to limit results to those at advanced reading levels.
To limit your search results to a specific reading level, follow these steps:
- On the search results page, click Advanced Search below the search box.
- Next to “Reading level” within the “Need more tools” section, select your desired reading level (basic, intermediate, or advanced) or choose to show all results annotated with reading levels.
- Click Advanced search at the bottom of the page.
- At any time, you can click the X in the right corner of the blue bar beneath the search box to go back to seeing all results.
I ran across this interesting bit of info just the other day. Google recently purchased Metaweb, a San Francisco-based semantic search company, because it “contains information on more than 12 million web ‘entities,’ from people to scientific theories.”
In other words, Google just bought a bunch of metadata. Metadata is basically descriptive information about something, such as the color of someone’s hair, their height, weight, etc. This purchase may signal that Google will soon add extra value to individual Internet resources and web sites. Ultimately, this means that your search results may become more accurate and relevant, and if Google steps up to the semantic web plate, will other search engines like BING do the same.
Here is a link for further reading at New Scientist.com which explains the details and what I could mean to you in the future.
ALA released the new “cataloging” standards, known as RDA or Resource, Description, Access, earlier this week. I have a feeling that life will become a lot more interesting for librarians and patrons alike because of this change. Why does this matter to the average Joe?
Libraries have been trying to incorporate online resources into the traditional library catalog since the new technologies arrived on scene. However, from the beginning these new technologies have defied the traditional library catalog classification system. They simply don’t fit the traditional “book” metadata format (metadata is descriptive information about a specific resource).
Catalogers eventually came to terms with this phenomena (some earlier than others), and they began forming a new set of organizing information or “cataloging” standard to enhance and eventually replace the current AARC2 cataloging standards.
The new standards will be geared to pull in metadata (e.g. title of a resource, description of it, and access) from online resources and make it much more useful and dynamic for the user. Hopefully, the future online catalog will be able to monitor changes to web resources and automatically update the changes by itself.
How useful would it be to have a master catalog of all resources in the United States (and even the world) of what libraries own and access. Instead of the libraries listing what materials they have, the national catalog would list resources that are available with ferberized metadata and the library could simply check a box essentially as to if they have it and where it is located. Not only would this help library patrons conceptualize what a library has to offer in terms of unique holdings and access, but it could be useful for vendors to help identify potential markets.
This is jumping the gun a little, but it is fun to dream! Below is the official news release from the American Libraries Magazine.
For Immediate Release
Tue, 07/13/2010 – 09:22
Contact: Jill Davis
Publishing (pub)CHICAGO—ALA Editions, the publishing imprint of the American Library Association, announces the release of “Introducing RDA: A Guide to the Basics,” by Chris Oliver. Resource Description and Access (RDA) is the new cataloguing standard that will replace the Anglo-American Cataloguing Rules (AACR). The 2010 release of RDA is not the release of a revised standard; it represents a shift in the understanding of the cataloguing process. Oliver, cataloguing and authorities coordinator at the McGill University Library and chair of the Canadian Committee on Cataloging, offers practical advice on how to make the transition. This indispensable Special Report helps catalogers by:
- concisely explaining RDA and its expected benefits for users and cataloguers, presented through topics and questions;
- placing RDA in context by examining its connection with its predecessor, AACR2, as well as looking at RDA’s relationship to internationally accepted principles, standards and models; and
- detailing how RDA positions us to take advantage of newly emerging database structures, how RDA data enables improved resource discovery and how we can get metadata out of library silos and make it more accessible.Oliver has worked at the McGill University Library since 1989, as a cataloguing librarian and cataloguing manager. She received her M.A. and M.L.I.S. degrees from McGill University. She is the chair of the Canadian Committee on Cataloguing and has been a member of the committee since 1997. This has given her the opportunity to be involved with the evolution of RDA from its beginning. She served as a member of the Joint Steering Committee’s Format Variation Working Group and as chair of the RDA Outreach Group. She has given presentations on RDA in Canada, the United States and internationally.
ALA Store purchases fund advocacy, awareness, and accreditation programs for library professionals worldwide. ALA Editions publishes resources used worldwide by tens of thousands of library and information professionals to improve programs, build on best practices, develop leadership, and for personal professional development. ALA authors and developers are leaders in their fields, and their content is published in a growing range of print and electronic formats. Contact ALA Editions at (800) 545-2433, ext. 5418, or email@example.com.
Read original source: http://www.americanlibrariesmagazine.org/news/ala/guide-rda-basics
For sports whose winners are determined by time (the fastest participant wins), here is a fun example of how data can be represented both audibly and visually. The New York Times put together this fun little interactive “Olympic Musical” that shows just how close top athletes end up placing in proximity to each other.
This resource does an excellent job of showing how raw data can be represented in new ways. It also shows just how diverse information can be interpreted, disseminated, and used. The screen shot below shows how the various Olympic results for Alpine skiing (this only gives you a taste of what this resource does & how information (raw data) can be represented). The Men’s Downhill results is in mid-process of playing–the yellow dots are where the piano note hits within milliseconds.
Go to NY Times article…
Here is an interesting article that talks about how many current catalogs have lousy algorithms, and don’t do an adequate job of finding information within an individual library’s catalog.
After Losing Users in Catalogs, Libraries Find Better Search Software
September 28, 2009
By Marc Parry
Thomas Jefferson founded the University of Virginia. So you might think that typing his name into Virgo, Virginia’s online library catalog, would start you off with a book about him.
Jean A. Bauer tried it the other night. At the top of the results list were papers from a physics conference in Brazil.
The problem is that traditional online library catalogs don’t tend to order search results by ranked relevance, and they can befuddle users with clunky interfaces. Bauer, a graduate student specializing in early American history, once had such a hard time finding materials that she titled a bibliography “Meager Fruits of an Ongoing Fight With Virgo.”
You’ve probably seen the work of Wordle without knowing what it is or does. From a campus meeting that took place yesterday, I discovered Wordle.net. It groups and aggregates words that are displayed within a web page. For words that are used several times within a web page, the size of the word grows to emphasize its use. Words that are large are used frequently, while small words are used infrequently.
I tested Wordle.net with this blog (Library Shop Talk) to see how it compares with the purpose of this blog, and came up with the following. (This is only the home page with ten of the latest posts). Google is the most frequently occurring word. I’m not too surprised because I write a lot about Google. It was interesting to see the other words that were highlighted/emphasized: Mobile, books, publishers, source, librarians, copyright, available, read library, settlement, school libraries, groups.
Keep in mind that this is a snapshot in time, and the words will likely be very different next week. However, to make it interesting, I checked the entire blog and came up with even greater variation (see below). The results seem identical, so I wonder if it wasn’t picking up everything in the blog, and just going with the top ten again. It doesn’t even pick up technology, which has dozens of post. At any rate, it is pretty fun and it is a useful analytical tool!
Here is a fascinating article by the NY Times on how browsers are saving personal information about you and delivering more customized advertisements.
Ads Follow Web Users, and Get More Personal
“Hello, this is Joe your personalized marketer. Since I know you so well, your preferences, price range, buying habits, I want to mention this great deal from your favorite store that you’ll definitely want to check out. They have a HUGE discount on all of your regular purchases. It is unbelievable! …”
… For decades, data companies like Experian and Acxiom have compiled reams of information on every American: Acxiom estimates it has 1,500 pieces of data on every American, based on information from warranty cards, bridal and birth registries, magazine subscriptions, public records and even dog registrations with the American Kennel Club.
A new twist for Microsoft’s BING: you (the general public) can teach the BING search engine to become more efficient, and it is fun too through the Page Hunt game. This is the line being toted by marketers at any rate. You get to play a fun game for free, and the research lab obtains useful data about search strategies (without having to pay its subjects).
Why develop a game in the first place? Many web pages and sites are hard for technology to index, or to gather metadata (that is, information that help describe and locate a web page). However, one way to make better search algorithms is to watch how humans interpret web pages, and to note what words or phrases they might use to search for it. Page Hunt is designed to do just that.
Players have three minutes to figure out unique words that would locate a given web page. The goal is to find key words that would make the web page come up first in the result list. The more times you guess the words “correctly,” the higher the score. A high score is desirable. Other examples of human computation include the text and image puzzles that prevent spam, and games that provide metadata for music and images. Look at my previous post about correcting images during account registration.
Hongkiat.com posted a list of 100 alternative search engines. The advantage of using one of these niche-specific search engines is that is zeroes in on the specific content you want, and excludes much of the irrelevant & excessive info you don’t need. Below are the major areas in which these resources specialize.
– Ebook & PDF Search Engines
(Highlights in this genre include Comic Seeker, Free Ebooks, Google Books, )
– Audio & Music Search Engines
(Highlights in this genre include BeeMP3, Find Sounds, and SkreemR–I like the names of these!)
– Video Search Engines
(Highlights in this genre include Hello Movies, which searches NetFlix, Hulu & more simultaneously, Blinkx, and ClipBlast)
– RapidShare Search Engines (file sharing)
(Highlights in this genre include Rapid Share, Files Pump, and RS Finder).