Wednesday, 31 December 2014

Important Aspects Of Web Data Scraping

Have you ever heard of "data scraping?" Scraping Data scraping technology to new technology and a successful businessman who made his fortune by making use of the data.

Sometimes website owners automated harvesting of your data can not be happy. Webmasters tools or methods that the content of websites to find block certain IP addresses from using their websites to disallow web scrapers have learned.  Allen are ultimately left with is blocked.

Venus is a modern solution to the problem. Proxy data scraping technology solves the problem by using proxy IP addresses. Every time your data scraping program performs an output of a website, the website thinks that it comes from a different IP address. The owner of this website, the proxy data scraping only a short period of increased traffic from all over the world looks like. They are very limited and boring ways of blocking such a script, but more importantly - most of the time, but they will not know they are scraped.

Now you might be asking yourself, "I can get for my project where data scraping proxy technology?" "Do it yourself" solution, but unfortunately, not. Need to mention. The proxy server you choose to rent consider hosting providers, but that option is fairly pricey, but definitely better than the alternative is incredibly dangerous (but) free public proxy servers.

But the trick is finding them. Many sites list hundreds of servers, but one that works to identify, access, and supports the type of protocol you need perseverance, trial and error, a lesson. Ten first, you do not know which server belongs to or what activities going on a server somewhere. Through a public proxy sensitive requests or to send data is a bad idea.

Proxy data scraping for a less risky scenario is to rent a rotating proxy connection along a large number of private IP addresses. www.webdatascraping.us companies scale anonymous proxy solutions, but often have a fairly hefty setup costs to get you going.

After performing a simple Google search, I quickly scrape using anonymous data for a company that has access to the proxy server biedt.kon finish.

Different techniques and processes for collecting and analyzing data, and has developed over time. Web scraping for business on the market recently. It is a process from various sources, such as databases and web sites with large amounts of data provides.

It's good to clear the air and people know that the data is the legal process to scrape. In this case, the main reason is because the information or data that is already available on the internet. It is important to know that this is a process to steal information, but there is a process of gathering reliable information. Most people considered unsavory behavior techniques.

So we collect data from a variety of websites and databases, web scraping define a process. A process either manually or through the use of software that can be achieved. Data mining companies to web-extraction and web crawling process to increase has led to greater use. The other important task of such enterprises for processing and analyzing the data are harvested. One of the important aspects about these companies is that they are experts in service.

Source:http://www.articlesbase.com/outsourcing-articles/important-aspects-of-web-data-scraping-6160374.html

Wednesday, 24 December 2014

Data Mining Explained

Overview

Data mining is the crucial process of extracting implicit and possibly useful information from data. It uses analytical and visualization techniques to explore and present information in a format which is easily understandable by humans.

Data mining is widely used in a variety of profiling practices, such as fraud detection, marketing research, surveys and scientific discovery.

In this article I will briefly explain some of the fundamentals and its applications in the real world.

Herein I will not discuss related processes of any sorts, including Data Extraction and Data Structuring.

The Effort

Data Mining has found its application in various fields such as financial institutions, health-care & bio-informatics, business intelligence, social networks data research and many more.

Businesses use it to understand consumer behavior, analyze buying patterns of clients and expand its marketing efforts. Banks and financial institutions use it to detect credit card frauds by recognizing the patterns involved in fake transactions.

The Knack

There is definitely a knack to Data Mining, as there is with any other field of web research activities. That is why it is referred as a craft rather than a science. A craft is the skilled practicing of an occupation.

One point I would like to make here is that data mining solutions offers an analytical perspective into the performance of a company depending on the historical data but one need to consider unknown external events and deceitful activities. On the flip side it is more critical especially for Regulatory bodies to forecast such activities in advance and take necessary measures to prevent such events in future.

In Closing

There are many important niches of Web Data Research that this article has not covered. But I hope that this article will provide you a stage to drill down further into this subject, if you want to do so!

Should you have any queries, please feel free to mail me. I would be pleased to answer each of your queries in detail.

Source: http://ezinearticles.com/?Data-Mining-Explained&id=4341782

Monday, 22 December 2014

GScholarXScraper: Hacking the GScholarScraper function with XPath

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function  get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage  into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essensially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

    Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).

    As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but f0r me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions!) :)

•    Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
•    Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
•    Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
•    Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
•    Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

</pre>
<div id="LC198"># #EXAMPLE 1: Display word cloud based on the title field of each Google Scholar search result returned</div>
<div id="LC199"># GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)</div>
<div id="LC200">#</div>
<div id="LC201"># # word freq</div>
<div id="LC202"># # game game 71</div>
<div id="LC203"># # comput comput 22</div>
<div id="LC204"># # video video 13</div>
<div id="LC205"># # learn learn 11</div>
<div id="LC206"># # [TRUNC...]</div>
<div id="LC207"># #</div>
<div id="LC208"># #</div>
<div id="LC209"># # Number of titles submitted = 210</div>
<div id="LC210"># #</div>
<div id="LC211"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC212"># #</div>
<div id="LC213"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>

<pre>

// image

I think that’s kind of cool and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :)  The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:

</pre>

<div id="LC215"># # EXAMPLE 2: Display word cloud based on the description field of each Google Scholar search result returned</div>
<div id="LC216">GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)</div>
<div id="LC217">#</div>
<div id="LC218"># # word freq</div>
<div id="LC219"># # page page 147</div>
<div id="LC220"># # gate gate 132</div>
<div id="LC221"># # game game 130</div>
<div id="LC222"># # baldur baldur 129</div>
<div id="LC223"># # roleplay roleplay 21</div>
<div id="LC224"># # [TRUNC...]</div>
<div id="LC225"># #</div>
<div id="LC226"># # Number of titles submitted = 210</div>
<div id="LC227"># #</div>
<div id="LC228"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC229"># #</div>
<div id="LC230"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>
<pre>

//image

Not bad. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set.

Code:

Full code for GScholarXScraper can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper

Original GSchloarScraper code is here: https://docs.google.com/document/d/1w_7niLqTUT0hmLxMfPEB7pGiA6MXoZBy6qPsKsEe_O0/edit?hl=en_US

Full code for just the XPath scraping function is here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R

Source:http://www.r-bloggers.com/gscholarxscraper-hacking-the-gscholarscraper-function-with-xpath/

Thursday, 18 December 2014

Data Extraction - A Guideline to Use Scrapping Tools Effectively

So many people around the world do not have much knowledge about these scrapping tools. In their views, mining means extracting resources from the earth. In these internet technology days, the new mined resource is data. There are so many data mining software tools are available in the internet to extract specific data from the web. Every company in the world has been dealing with tons of data, managing and converting this data into a useful form is a real hectic work for them. If this right information is not available at the right time a company will lose valuable time to making strategic decisions on this accurate information.

This type of situation will break opportunities in the present competitive market. However, in these situations, the data extraction and data mining tools will help you to take the strategic decisions in right time to reach your goals in this competitive business. There are so many advantages with these tools that you can store customer information in a sequential manner, you can know the operations of your competitors, and also you can figure out your company performance. And it is a critical job to every company to have this information at fingertips when they need this information.

To survive in this competitive business world, this data extraction and data mining are critical in operations of the company. There is a powerful tool called Website scraper used in online digital mining. With this toll, you can filter the data in internet and retrieves the information for specific needs. This scrapping tool is used in various fields and types are numerous. Research, surveillance, and the harvesting of direct marketing leads is just a few ways the website scraper assists professionals in the workplace.

Screen scrapping tool is another tool which useful to extract the data from the web. This is much helpful when you work on the internet to mine data to your local hard disks. It provides a graphical interface allowing you to designate Universal Resource Locator, data elements to be extracted, and scripting logic to traverse pages and work with mined data. You can use this tool as periodical intervals. By using this tool, you can download the database in internet to you spread sheets. The important one in scrapping tools is Data mining software, it will extract the large amount of information from the web, and it will compare that date into a useful format. This tool is used in various sectors of business, especially, for those who are creating leads, budget establishing seeing the competitors charges and analysis the trends in online. With this tool, the information is gathered and immediately uses for your business needs.

Another best scrapping tool is e mailing scrapping tool, this tool crawls the public email addresses from various web sites. You can easily from a large mailing list with this tool. You can use these mailing lists to promote your product through online and proposals sending an offer for related business and many more to do. With this toll, you can find the targeted customers towards your product or potential business parents. This will allows you to expand your business in the online market.

There are so many well established and esteemed organizations are providing these features free of cost as the trial offer to customers. If you want permanent services, you need to pay nominal fees. You can download these services from their valuable web sites also.

Source: http://ezinearticles.com/?Data-Extraction---A-Guideline-to-Use-Scrapping-Tools-Effectively&id=3600918


Tuesday, 16 December 2014

Online Data Entry and Data Mining Services

Data entry job involves transcribing a particular type of data into some other form. It can be either online or offline. The input data may include printed documents like Application forms, survey forms, registration forms, handwritten documents etc.

Data entry process is an inevitable part of the job to any organization. One way or other each organization demands data entry. Data entry skills vary depends upon the nature of the job requirement, in some cases data to be entered from a hard copy formats and in some other cases data to be entered directly into a web portal. Online data entry job generally requires the data to be entered in to any online data base.

For a super market, data associate might be required to enter the goods which have sold in a particular day and the new goods received in a particular day to maintain the stock well in order. Also, by doing this the concerned authorities will get an idea about the sale particulars of each commodity as they requires. In another example, an office the account executive might be required to input the day to day expenses in to the online accounting database in order to keep the account well in order.

The aim of the data mining process is to collect the information from reliable online sources as per the requirement of the customer and convert it to a structured format for the further use. The major source of data mining is any of the internet search engine like Google, Yahoo, Bing, AOL, MSN etc. Many search engines such as Google and Bing provide customized results based on the user's activity history. Based on our keyword search, the search engine lists the details of the websites from where we can gather the details as per our requirement.

Collect the data from the online sources such as Company Name, Contact Person, Profile of the Company, Contact Phone Number of Email ID Etc. are doing for the marketing activities. Once the data is gathered from the online sources into a structured format, the marketing authorities will start their marketing promotions by calling or emailing the concerned persons, which may result to create a new customer. So basically data mining is playing a vital role in today's business expansions. By outsourcing the data entry and its related works, you can save the cost that would be incurred in setting up the necessary infrastructure and employee cost.

Source:http://ezinearticles.com/?Online-Data-Entry-and-Data-Mining-Services&id=7713395

Saturday, 13 December 2014

Microfinance Data Scraping

I went to the Datakind‘s New York Datadive last November and met the Microfinance Information Exchange (MIX), a group that ‘delivers data services, analysis, research and business information on the institutions that provide financial services to the world’s poor’. They wanted to see whether web-scraping could save them from manually gathering data. So fellow divers and I showed MIX the utility of web-scraping. Over the course of a day, about six people scraped data about microfinance institutions from a bunch of websites, saving MIX an estimated year of manual data entry.

Over the past few months, I worked further with MIX to study who has access to what sorts of financial services. DataKind just put up our blog post about the project. Read the post, or just look at the map and explore the data.

Source:https://blog.scraperwiki.com/2012/05/microfinance-data-scraping/

Thursday, 11 December 2014

Ethics in data journalism: mass data gathering – scraping, FOI and deception

Mass data gathering – scraping, FOI, deception and harm

The data journalism practice of ‘scraping’ – getting a computer to capture information from online sources – raises some ethical issues around deception and minimisation of harm. Some scrapers, for example, ‘pretend’ to be a particular web browser, or pace their scraping activity more slowly to avoid detection. But the deception is practised on another computer, not a human – so is it deception at all? And if the ‘victim’ is a computer, is there harm?

The tension here is between the ethics of virtue (“I do not deceive”) and teleological ethics (good or bad impact of actions). A scraper might include a small element of deception, but the act of scraping (as distinct from publishing the resulting information) harms no human. Most journalists can live with that.

The exception is where a scraper makes such excessive demands on a site that it impairs that site’s performance (because it is repetitively requesting so many pages in a small space of time). This not only negatively impacts on the experience of users of the site, but consequently the site’s publishers too (in many cases sites will block sources of heavy demand, breaking the scraper anyway).

Although the harm may be justified against a wider ‘public good’, it is unnecessary: a well designed scraper should not make such excessive demands, nor should it draw attention to itself by doing so. The person writing such a scraper should ensure that it does not run more often than is necessary, or that it runs more slowly to spread the demands on the site being scraped. Notably in this regard, ProPublica’s scraping project Upton “helps you be a good citizen [by avoiding] hitting the site you’re scraping with requests that are unnecessary because you’ve already downloaded a certain page” (Merrill, 2013).

Attempts to minimise that load can itself generate ethical concerns. The creator of seminal data journalism projects chicagocrime.org and Everyblock, Adrian Holovaty, addresses some of these in his series on ‘Sane data updates’ and urges being upfront about

    “which parts of the data might be out of date, how often it’s updated, which bits of the data are updated … and any other peculiarities about your process … Any application that repurposes data from another source has an obligation to explain how it gets the data … The more transparent you are about it, the better.” (Holovaty, 2013)

Publishing scraped data in full does raise legal issues around the copyright and database rights surrounding that information. The journalist should decide whether the story can be told accurately without publishing the full data.

Issues raised by scraping can also be applied to analogous methods using simple email technology, such as the mass-generation of Freedom of Information requests. Sending the same FOI request to dozens or hundreds of authorities results in a significant pressure on, and cost to, public authorities, so the public interest of the question must justify that, rather than its value as a story alone. Journalists must also check the information is not accessible through other means before embarking on a mass-email.

Source: http://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/