DISQUS

Plagiarism Today: Scraping Starts from the Very First Post

  • Randy Charles Morin · 2 years ago
    Are you sure the scraping was for splogging? If you ping weblogs.com with a new blog, then many RSS based search engines will pick it up and begin reporting 1 subscriber immediately. It's not that they are incorrectly reporting 1 subscriber, but rather that they don't report subscribers to FeedBurner, so FeedBurner assumes 1 subscriber.
  • WillMacc · 2 years ago
    The moving of the blog and all that came after really wasn't to see who was scraping; but it did provide somewhat confirmation that probably a lot of those still visiting the site and the site feed was more than likely bots that are still habitually visiting the site/feed daily.
    I understand that people don't regularly check RSS readers - as in Google Feed Fetcher - but I would dare to guess that's only a small percentage of the visits.
    Moving the blog also provided me a chance to see who/what visits the site; as where with my WordPress.com blog, I could only see hits and not the actual visit information.
    Since the move; probably 10 crawlers have been shotdown from scraping the content of the blog, BUT, scraping a blog like mine isn't that big of a deal since it's the information on the blog that's important. So, if my content gets scrapped and ends up up another blog - fine; the information is still valid and people still get to see who's doing what with what and whom.. :)

    Thanks,
    WillMacc
  • WillMacc · 2 years ago
    Also... :)
    A lot of "pseudo" feeders are attached and monitor other ping services.
    I've seen countless visits from known crawlers with "bad intentions" hit the site as soon as a ping is sent out.
    If you have a blog hosted on your own domain, you can issue a ping (to only one service - say; pingomatic) and then sit back and watch who starts hitting the site.
    You'll see quickly a boat load of crawlers come and a lot of them will not appear as crawlers, but as regular user-agents. If you follow the trends of the crawlers/visitors after a ping, you'll probably start noticing some visitors will not pull any graphics on the blog; or only pull one hit as where most visitors will have line upon line of various content, items, and graphics that's embedded into the blog themes and within the articles.
    Those that do that are Usually bots and not legit users, but having said that, you'll have to be careful and pick out the rss readers from the bots and crawlers.

    Thanks,
    WillMacc
  • JB · 2 years ago
    Morin,

    As I said in the article, I'm not sure. There are two things that do disturb me, the first is that Wordpress could not identify most of the feed readers. I would thing that it would recognize one from an obvious source such as Weblogs.

    Second, those it DID identify were listed as "Web Browsers" and there should not have been any human subscribers to the feed (I didn't even subscribe). Many scrapers hide their bots by having them identify themselves as Web browsers, it is a well-known trick.

    I would say that about 80% of the subscribers were listed as "unknown" and the rest were Web Browsers. I wish I had taken a screenshot of that as well but I was in a rush due to the move. I might reignite the experiment later today and see what happens.

    WillMacc,

    Thanks for providing further confirmation to my theory. If you have any statistics on that, I would love to see them, perhaps we should work together and form a more thorough study? This was just quick and dirty to get a feel for the problem.

    Obviously more research needs to be done as the problem is greater than even I imagined...
  • Elf's DH · 2 years ago
    I have done several searches for the scraper sites but have had no luck in locating them.

    Not all scraping is done for splogs. I've gotten (and I'm sure everyone else has gotten) spam emails that have scraped sentences from random websites in order not to be filtered out as gibberish by Bayesian filters. (A particularly amusing one I got reconstituted the descriptions of birds from the Audubon Society).
  • JB · 2 years ago
    Elf's DH,

    I'd heard of that but had not seen an actual case of it taking place. Sure I've gotten the spam with the text in it, but I've never seen my own work used in that way.

    Sadly though, you may be very right. If that's the case, the odds of me finding this text is slim to absolutely none.
  • engtech @ internet duct tape · 2 years ago
    I think you're misinterpreting your data source. The wordpress.com feed stats always follow the ebb and flow of your posting frequency.

    I have a popular wordpress.com blog, and my feed readers are split between the wordpress.com feed (http://engtech.wordpress.com/feed or http://internetducttape.com/feed) and the FeedBurner feed (http://feeds.feedburner.com/engtech). Wordpress.com doesn't let me redirect to my feedburner feed.

    About 624 of my readers are in FeedBurner, there's another 400-700 who grab the feed directly.

    Here are screenshots of my stats from wordpress.com and from FeedBurner. As you can see, there are serious discrepancies. I trust the FeedBurner stats much more.

    http://i115.photobucket.com/albums/n296/engtech...
    http://i115.photobucket.com/albums/n296/engtech...
    http://i115.photobucket.com/albums/n296/engtech...


    To make it worse, the wordpress.com stats seem to be pretty dumb in that they count feed reader hits even if it's just someone clicking on your link from another feed. Not an issue for this experiment, but something to note.

    Bottom line: no conclusions can be drawn from using wordpress.com feed stats. Set up a blog somewhere that let's you use FeedBurner stats and you'll have a *much* better data sample.

    Interesting idea, but the data you're basing it off of is so questionable to start with.