We've been getting more and more web sites being built on Syrinx Community Server lately and an issue we've seen come up is getting content from their old site into the new site built with SCS. Some are forums that have had tens of thousands of posts. Rather than trying to write code to use some other products db directly, such as moving a forum from phpbb to SCS, we wrote a configurable utility feature that can browse to their current site and grab the content from the web pages and populate the new site's normalized database.
With this we can read all of a client's blog posts directly from their blog site and create articles from them in their new SCS site. We can create contacts, articles, products, forum posts, blogs and photo albums from content contained in other sites. It will even download all the images found in content being converted and fix the image tags in the new site to use the images from the local site.
After 3 iterations of development on the design and code for this feature, it has really come a long way in its flexibility in finding content in an existing web site and getting it in place within the new site. It can be configured with a user id and password to log into a site and keep track of the cookies as it goes to various pages on the site.
The feature is designed to look for lists of objects on a page, and then follow links to get details about the object before it inserts the object into the SCS database. For example, it can find a list of blog entries on the page and follow the links to the full post for each blog entry and then create an article in the SCS database for each of those articles. If the blog entries had images, those would get downloaded and placed into the SCS media library.
Each site that is going to be processed like this needs its own XML configuration that defines the site url, what user id/password to use, regular expressions to match against html to find the elements of data desired, and a few other options. You have to have a good grasp of how to use regular expressions if you want to write a configuration.
When the code is grabbing pages from a site, it can either work with one page at a time with a single site session or the parallel processing option can be turned on to allow it to grab as many web pages from the site at the same time as it can. This can make a big difference in the time it takes to grab all the desired content. As an example, a news site with 97 different news stories and images in a small percentage of them took 27 seconds to completely grab all the content when run in parallel, but too a minute and 30 seconds when run sequentially.
If you'd like to get the code for this, it comes with Syrinx Community Server. You can also read the developing documentation on the topic at:
syrinx.ph/Developer.aspx#gArt1_311FE53F-8587-4637-B6BA-15BE177FF2A6