Data Rubbing is a technique in which a computer program extracts data from human readable output coming from another program.
Video Data scraping
Description
Typically, data transfer between programs is performed using a data structure suitable for automated processing by computers, not people. Such interchange formats and protocols are usually rigidly structured, well-documented, easily parsed, and make ambiguity to a minimum. Very often, this transmission can not be read human at all.
Thus, the key element that distinguishes the scratching of data from ordinary parsing is that the scratched output is intended to be shown to the end user, not as an input to another program, and is therefore usually not documented or structured for easy decoding. Data scrubbing often involves abandoning binary data (usually images or multimedia data), display formats, superfluous labels, exaggerated comments, and other information that is irrelevant or inhibits automatic processing.
Scraping data is most often done either for interface to legacy systems that do not have other mechanisms compatible with current hardware, or for interface to third-party systems that do not provide a more convenient API. In the second case, third party system operators will often see screen raking as unwanted, for reasons such as increased system load, lost ad revenue, or loss of control over information content.
Scraping data is generally regarded as ad hoc , inelegant techniques, often used only as "last resort" when there is no other mechanism for exchange of available data. Apart from higher programming and overhead processing, the output display intended for human consumption often changes the structure. Humans can handle this easily, but computer programs can report nonsense, have been told to read data in a certain format or from a specific place, and without the knowledge of how to check the results for validity.
Maps Data scraping
Technical variant
Screen scraping
Screen scraping is typically associated with a preprogrammed collection of visual data from a source, rather than parsing data like in Web friction. Initially, scratching refers to the practice of reading text data from a computer display terminal display. This is generally done by reading the terminal memory through an additional port, or by connecting the terminal output port from one computer system to another. The term scratch screen is also commonly used to refer to two-way data exchange. This can be a simple case where the controller program navigates through the user interface, or a more complex scenario where the controller program enters data into the interface intended for human use.
As a concrete example of the classical screen scraper, consider a hypothetical inheritance system dating from the 1960s - the commencement of computerized data processing. Computers for the user interface of that era are often just dumb text based terminals that are nothing more than virtual teleprinters (such systems are still in use today, for various reasons). The desire to connect such a system with a more modern system is common. Resilient solutions will often require things that are no longer available, such as source code, system documentation, APIs, or programmers with experience in 50-year-old computer systems. In such cases, the only possible solution would be to write a "pretend" screen saver to a user in the terminal. The scraper screen may be connected to the legacy system via Telnet, mimicking the required keystrokes to navigate the old user interface, processing the resulting display output, extracting the desired data, and passing it on to the modern system. (Such sophisticated and robust applications, built on platforms that provide the governance and control needed by large companies - for example, change control, security, user management, data protection, operational auditing, load balancing and queue management etc..- can be said as an example of robot-process automation software.)
In the 1980s, financial data providers such as Reuters, Telerate, and Quotron display data in 24-40 format intended for human readers. Users of this data, especially investment banks, write applications to capture and convert this character data as numerical data to be included in calculations for trading decisions without re-entering data. The general term for this practice, especially in the UK, is the destruction of pages , because the results can be imagined to have passed through the paper shredder. Internal Reuters uses the term 'logical' for this conversion process, running a sophisticated computer system on a VAX/VMS called Logicizer.
More modern screen scrap techniques include capturing bitmap data from the screen and running it through an OCR engine, or for some special automated testing systems, which match the bitmap data of the screen against the expected results. This can be combined in the case of a GUI application, with graphical control queries by programmatically obtaining a reference to the underlying programming object. The order of screens is automatically captured and converted into database.
Another modern adaptation to this technique is to use, instead of a screen order as input, a set of images or PDF files, so there is some overlap with generic "sorting documents" and reporting mining techniques.
Scraping the web
Web pages are built using text-based marking language (HTML and XHTML), and often contain lots of useful data in text form. However, most web pages are designed for human end users and not for ease of use automatically. Therefore, a kit tool that scrapes web content has been created. A web scraper is an API or a tool to extract data from a website. Companies like Amazon AWS and Google provide free, cost-effective web, service and public data scrapings for end users. The newer forms of web friction involves listening to data feeds from a web server. For example, JSON is typically used as a transport storage mechanism between a client and a web server.
Recently, the company has developed a web digging system that relies on the use of techniques in DOM decoding, computer vision, and natural language processing to simulate the human processing that occurs when viewing webpages to extract useful information automatically.
Large websites typically use defensive algorithms to protect their data from web-breakers and to limit the number of requests sent by IP or IP networks. This has led to ongoing battles between website developers and developer dredging.
Mining report
The mining report is the extraction of data from human readable computer reports. Conventional data extraction requires connection to a functioning source system, appropriate connectivity standards or APIs, and usually complex queries. Using the standard source system reporting options, and redirecting the output to a spool file instead of to the printer, static reports can be made according to offline analysis through report mining. This approach can avoid intensive CPU usage during business hours, it can minimize the cost of end-user licenses for ERP customers, and can offer very fast prototype manufacturing and custom report development. While scraping and web scraping involves interaction with dynamic output, the report mining involves extracting data from files in human readable formats, such as HTML, PDF, or text. This can be easily generated from almost any system by intercepting the data feed to the printer. This approach can provide a fast and simple route for obtaining data without the need to program the API into the source system.
See also
- Comparison of feed collectors
- Data cleaning
- Data munging
- Importer (count)
- Information extraction
- Open data
- Mashup (hybrid web app)
- Metadata
- Web scraping
- Search engine
References
Further reading
- Hemenway, Kevin and Calishain, Tara. Spidering Hacks . Cambridge, Massachusetts: O'Reilly, 2003. ISBNÃ, 0-596-00577-6.
Source of the article : Wikipedia