Knowledge Discovery and Data-mining

From IAE-Pedia

Jump to: navigation, search
IAE-pedia Header.png


Contents


This is a Work in Progress. It is a long way from completion. Readers are encouraged to contribute to this page.

People interested in the KDD topic may also enjoy reading:


Introduction

Here is a quote from a short article by Sid Perkins in the 8/14/2010 issue of Science News:

Researchers poring over Google Earth images have discovered one of the planet's freshest impact craters—a 45-meter-wide pock in southwestern Egypt that probably was evacuated by a fast-moving meteorite no more than a few thousand years ago.

This is an excellent example of data mining. The Google company provides a search engine named Google. It also provides at no cost a variety of software and a lot of Web content. Google Earth contains a large collection of views of the earth provided by Google.

People using Google Earth have different purposes in mind. Many of these purposes will be ones that Google did not think of when it was creating this database. The researchers mentioned in the quote given above had a particular interest and searched the database looking for images that met their needs. That is, they did research by data mining.

Well, what's the big deal? Users of the Web do this all of the time. A Web user poses a problem or question, translates it into search terms, and uses a Web search engine to seek information. The Web is a huge collection of data,information, and knowledge organized into Web pages.

The knowledge and skills of the search engine are combined with the knowledge and skills of the person using the search engine in an attempt to find data that fits the needs of the searcher. The artificially intelligent computer search engine and the human's intelligence are combined in an attempt to find the desired data. This is an excellent example of Two Brains Are Better Than One.

Some History

The early development of electronic digital computers focused on developing machines that could accurately and rapidly carry out mathematical calculations. Scientists needed machines to help them do the calculations that provided solutions to the equations they used to describe their mathematical models. The military needed to compute artillery firing tables.

As electronic digital computers became more readily available, their uses in government, business, and industry became evident. Data processing emerged as a dominant use of computers. Data was collected, stored, and processed. Colleges and even some high schools began to offer courses in Business Data Processing.

Eventually the ideas shown in the following figure began to become clearer.

Data 5-part.jpeg

Computers could certainly be used for data processing. But, computers could be used for information processing. Existing and Emerging Computer Science Departments often reflected this in their choice of title—Computer and Information Science.

In still more recent times, Knowledge Discovery and Data-mining (KDD) emerged as an important sub discipline of Computer and Information Science. It is now clear that computers are very useful throughout the first four points in the table given above.

As progress in artificial intelligence and in cognitive neuroscience continues, we are inching our way into developing computer programs that have foresight.

Like all "progress" in technology and in development of tools, there are booth good and bad aspects. Some people and groups of people are emowered more than others, and some lose power. This has been going on for a very long time. Think about the early printing presses used when Gutenberg developed metal, movable type. Think about the cotton gin and about other farm machinery used for plowing and harvesting. Automation of physical and mental activities changes the way people work (and play).

Deluged in Data and Information

You know that a large amount of information about you and other people has been collected and is stored in a combination of hard copy and digitized forms. In addition, huge amounts of information are being gathered about other aspects of our world, solar system, galaxy, and universe. As an example, the Hadron Collider collects a lot of data. Quoting from http://lhcb-public.web.cern.ch/lhcb-public/en/Data%20Collection/Triggers-en.html:

"After filtering by the first level trigger, a very large amount of data still remains. 35 gigabytes - equivalent to 8 DVDs worth of information - is fed every second into 2,000 state-of-the-art computers, located deep underground at the LHCb site.

The following article helps provide insight into how much data is now being collected each year.

Collett, Stacy (8/23/20100). Five indispensable IT skills of the future. Computerworld. Retrieved 8/27/2010 from http://www.computerworld.com/s/article/350908/5_Indispensable_IT_Skills_of_the_Future. Quoting from the article:

In the year 2020, technical expertise will no longer be the sole province of the IT department. Employees throughout the organization will understand how to use technology to do their jobs.

Yet futurists and IT experts say that the most sought-after IT-related skills will be those that involve the ability to mine overwhelming amounts of data, protect systems from security threats, manage the risks of growing complexity in new systems, and communicate how technology can increase productivity.

By 2020, the amount of data generated each year will reach 35 zettabytes, or about 35 million petabytes, according to market researcher IDC. That's enough data to fill a stack of DVDs reaching from the Earth to the moon and back, according to John Gantz, chief research officer at IDC.

A petabyte is 10^15 bytes. A zettabyte is 10^21 bytes. A medium length novel is about 10^6 bytes. The holdings of US Library of Congress are less than a petabyte. So, the prediction is that by ten years from now, we will be collecting data at the rate of 35 million Library of Congress' per year.

Big Brother is Watching You

The very large collections of data can be used to help answer questions and solve problems that people are interested in. Suppose, for example, that you are the tax collector in a city where many people have outdoor swimming pools. Some of the people got the necessary building permits, safety and building inspections, and so on when they built or had their pools build. Others did not. You make use of Google Earth to look at detailed aerial views of the tax district. You visually identify the outdoor pools, and you look up each of the properties on the tax records and building permit records. You succeed in identifying a number of pools that don't have building permits and that are not on the tax records.

Hmm. Is this legal? Is it an invasion of privacy? Is it fair for "Big Brother" to use such technology?

And, what about having computerized traffic cameras that can read a car's license plate, send information to a centrally located computer, have the computer look up the traffic record (and other legal information such as wants and warrants) on the person listed as the car's owner, and then direct a patrol car to the scene if the computer system "thinks" this would be appropriate to do.

To carry this example one step further, we now have computerized systems that are fairly good at identifying a person from a TV picture of the person. Such a system is useful in a variety of Homeland Security activities. More and more, you can believe that "Big Brother" is watching you and is making use of very powerful computerized database systems.

Different Types of Databases

There are many different types of databases. A grocery shopping list or a "to do today" list are simple examples of personal databases. You might write these on pieces of paper or you might put them into an electronic digital storage device.

Back when reading and writing were first being developed a little over 5,000 years ago, clay or stone tablets were used instead of paper or personal digital assistants (PDA). But then, as now, the goal was to make use of a memory aid that would persist (last) over time and that could shared with others. The early impetus was the value of such aids in business and government. Many more important uses have been developed over the past 5,000 years.

Changing needs and changing technology have led to the development of many different types of database structures. See http://www.theukwebdesigncompany.com/articles/types-of-databases.php.

===Flat, Fixed Fields Data Base===

The simplest type of database is a table with one or more columns. For example, suppose that you are going to invite a number of people to a party. You make a list of the names of the people you want to invite. That is a database.

Next you add a second column of data. This column contains a telephone number for each person. Now the rows of your data base consist of pairs of information—a person's name and telephone number.

But, what if a person has more than one telephone number or does not have a telephone number. You might want to add a third column for a person's (possible) second phone number and a fourth column for a person's email address.

Flat file database: See http://en.wikipedia.org/wiki/Flat_file_database. Quoting from the document:

The first uses of computing machines were implementations of simple databases. Herman Hollerith conceived the idea that census data could be represented by holes punched in paper cards and tabulated by machine. He sold his concept to the US Census Bureau; thus, the Census of 1890 was the first ever computerized database—consisting, in essence, of thousands of boxes full of punched cards.
The first uses of computing machines were implementations of simple databases. Herman Hollerith conceived the idea that census data could be represented by holes punched in paper cards and tabulated by machine. He sold his concept to the US Census Bureau; thus, the Census of 1890 was the first ever computerized database—consisting, in essence, of thousands of boxes full of punched cards.
Strictly, a flat file database should consist of nothing but data and, if records vary in length, delimiters. More broadly, the term refers to any database which exists in a single file in the form of rows and columns, with no relationships or links between records and fields except the table structure.
… the basic terms "record" and "field" are used in nearly every flat file database implementation.

In a flat file database, items in a field bear some relationship to each other. For example, one field might be the last name of employees, a second field the first names of the same employees, and a third field their hourly rate of pay, and the next seven fields an employee's hours worked on M, Tu, W, Th, F, Sat, and Sunday of a particular work week. It is this semantic organization that makes such databases so powerful and relatively easy to use.

Linked Database

It may be that you already have a comprehensive database of friends, acquaintances, repair and service people you have hired in the past, and so on. If so, the work of creating your party database could be simplified by merely listing the names of people you want to invite along with a link to their names in your master database.


This is conveniently done with computerized databases. One merely makes each name in the party list an active link (a clickable link) to

(Work in progress.) Describe fixed field database with the data collected and organized to help answer predetermined questions. A hard copy or electronic telephone directory is a good example. A companies employees database and customers data base are other example.

A physical (hard copy) library is a type of data base. At the first level, the "data" are the documents. A user's goal is to find (retrieve) a particular document. As libraries grew in size, people developed the idea of a card catalog. One of more cards correspond to a document. The cards are arranged alphabetically in drawers. A user can search by author name …

Flat file database: See http://en.wikipedia.org/wiki/Flat_file_database. Quoting from the document:

The first uses of computing machines were implementations of simple databases. Herman Hollerith conceived the idea that census data could be represented by holes punched in paper cards and tabulated by machine. He sold his concept to the US Census Bureau; thus, the Census of 1890 was the first ever computerized database—consisting, in essence, of thousands of boxes full of punched cards.
The first uses of computing machines were implementations of simple databases. Herman Hollerith conceived the idea that census data could be represented by holes punched in paper cards and tabulated by machine. He sold his concept to the US Census Bureau; thus, the Census of 1890 was the first ever computerized database—consisting, in essence, of thousands of boxes full of punched cards.
Strictly, a flat file database should consist of nothing but data and, if records vary in length, delimiters. More broadly, the term refers to any database which exists in a single file in the form of rows and columns, with no relationships or links between records and fields except the table structure.
… the basic terms "record" and "field" are used in nearly every flat file database implementation.

In a flat file database, items in a field bear some relationship to each other. For example, one field might be the last name of employees, a second field the first names of the same employees, and a third field their hourly rate of pay, and the next seven fields an employee's hours worked on M, Tu, W, Th, F, Sat, and Sunday of a particular work week. It is this semantic organization that makes such databases so powerful and relatively easy to use.

There are many general types of databases. See http://en.wikipedia.org/wiki/Database.

As an example, see Semantic Database at http://semanticdb.blogspot.com. Quoting:

Semantic Database realizes the Semantic Web vision of Sir Tim Berner Lee. This space will hold the thoughts and ideas that comes to me while talking to different people who are interested in this topic. Semantic DB is an attempt to create a Database(knowledge base) where each data elements are related to every other elements based on meaning. I am Still working on it. If you are reading this then please feel free to write your comment on the posts you are reading.

KDD Student Contest

Worcester Polytechnic Institute (2010). At D.C. Conference, WPI PhD Student Presents Research on How Students Learn at Different Rates. Retrieved 8/2/2010 from http://www.wpi.edu/news/20101/2010zach.html. Quoting from the article:

Worcester Polytechnic Institute (WPI) PhD student Zach Pardos placed second among student teams and fourth place overall out of more than 600 teams, in the 2010 Knowledge Discovery and Data-mining (KDD) Cup – a two-month-long, high-profile annual data mining competition run by the Association of Computing Machinery (ACM). His performance at the KDD Cup allowed him the chance to present his research findings this week in Washington, D.C.
Data mining is the process of analyzing data from different perspectives and summarizing it into useful information, and, for his achievement at the KDD Cup, Pardos received $3,000 in prize money and also travel funds to attend the July 25-28 Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Conference in the nation's capital. At the conference, Pardos, a native of Denver, Colo., had several opportunities to present his research, which is under consideration to be published in an upcoming issue of the Journal of Machine Learning Research.

Online Games

Barras, Colin (8/23/2010). Online games are a gold mine for design ideas. NewScientist. Retrieved 8/26/2010 from http://www.newscientist.com/article/mg20727745.100-online-games-are-a-gold-mine-for-design-ideas.html. Quoting from the article:

GONE are the days when video gaming was a private pursuit. Gaming services such as Microsoft's Xbox Live not only connect players in living rooms the world over, they can also record every move each gamer makes. Academic researchers are learning to use information mined from this mountain of data to build more stimulating games - and commercial games designers are beginning to take notice.
"All of the big games publishers are getting into data mining," says Julian Togelius of the Center for Games Research at the IT University of Copenhagen, Denmark. "They're talking to universities, even hiring researchers to work on some of these huge data sets."

Weber has used this approach to create a robot player called EISBot. He downloaded thousands of replays, and used machine-learning algorithms to identify patterns in the data that helped predict how games would unfold. That knowledge was then encoded into EISBot. After only a few minutes of game play, EISBot can predict an opponent's strategy with 70 per cent accuracy at least 2 minutes before it is executed - an advantage in a real-time game.

References

University of Missouri News Bureau (11/1/2010). Researchers Expand Cyberspace to Fight Chronic Condition in Breast Cancer Survivors, retrieved 11/5/2010 from http://munews.missouri.edu/news-releases/2010/1101-researchers-expand-cyberspace-to-fight-chronic-condition-in-breast-cancer-survivors/. Quoting from the article:

COLUMBIA, Mo. – Lymphedema is a chronic condition that causes swelling of the limbs and affects physical, mental and social health. It commonly occurs in breast cancer survivors and is the second-most dreaded effect of treatment, after cancer recurrence. Every day, researchers throughout the world learn more about the condition and how it can be treated. Now, University of Missouri researchers are developing a place in cyberspace where relevant and timely information can be easily stored, searched, and reviewed from anywhere with the goal of improving health care through the availability of up-to-date, evidence-based research.
“We want to bring researchers, medical professionals and care providers together to improve patients’ health,” said Chi-Ren Shyu, principal investigator for the project and director of the MU Informatics Institute. “Merging all of the data into one virtual space and discovering clinically significant knowledge from the haystacks of data will make cutting-edge research and treatments available to patients sooner.”
Currently, people looking for information about lymphedema treatment have to visit dozens of medical websites or consult a best practices document, which has not been updated since 2006. The new system will enable immediate access to data, best practices, literature and research from around the world as it is posted online, all in a single, searchable online database.
“The cyber-infrastructure, once complete, can be applied to other diseases and chronic conditions, such as diabetes or cardiovascular disease,” Shyu said. “Potential users include researchers, medical professionals, social workers, patients and their families.”

Author or Co-authors

The initial version of this page was created by David Moursund.

Personal tools