Sitemap for Images

Permalink
Greetings fellow C5 users,
I was wondering if there's a job available built in, or an add-on available, to have C5 generate a sitemap for images? I looked and didn't find anything...

If not, i guess i'll need to make a custom job...

slafleche
 
TMDesigns replied on at Permalink Best Answer Reply
TMDesigns
Hi,

the way C5 deals within images i dont think this would be possible. C5 tends to use caching for most of its images and not straight paths.

I might be wrong but I imagine there is nothing made for this.

Tim
mhawke replied on at Permalink Reply
mhawke
I agree. The file storage in C5 is seemingly random.(it's not random but it seems that way). They are stored in

[root]/files/xxxx/xxxx/xxxx/yourimage.jpg

In order to 'sitemap' them, you would need to pull every page in the site and scrap the URL's from the <img> tags on the pages.
slafleche replied on at Permalink Reply
slafleche
Oh yeah, I didn't think about that... Thanks!
jvansanten replied on at Permalink Reply
If you have access to the command line, this is the most natural way to do so, if you simply want a list of images and their location.

tree -if --noreport directory/
mhawke replied on at Permalink Reply
mhawke
That will literally pour a ton of useless file names into your command window. C5 saves a new image for every version of a page. If you have 20 pages with 2 images per page and 20 versions of each page, you're going to get 800 images listed when only 40 are actually 'active'.

I might be wrong on this but I believe that's how it works.
slafleche replied on at Permalink Reply
slafleche
Thanks for the suggestion jvansanten. That could work for a developer, but most of our clients aren't tech savvy enough to use the command line. I was looking for a solution that could be automated.
jvansanten replied on at Permalink Reply
This would at least get the data, if not the precise presentation.

Something like the following:

You could call a bash script from php using the exec command and put this in a block.

$script = "/util/getimages.sh";
$output_file = "/util/images/image_list.txt"
exec( "$script $output_file" );

readfile ($outputfile);

Configure your PHP mime types to download a txt file rather than open in a browser window.
Job replied on at Permalink Reply
Job
That would create a dependency as most hosts don't allow access to exec() for security. So all sites would have to be on a VPS then is configured to allow exec().
jvansanten replied on at Permalink Reply
Alternatively, if you need metadata, you could query the table that holds images and return the desired values.

You could create a block with a dashboard interface to display.
mhawke replied on at Permalink Reply
mhawke
Which table contains this info?
JohntheFish replied on at Permalink Reply
JohntheFish
Remo did something very close to this for checking links from pages as an example in his first book.
jshannon replied on at Permalink Reply
jshannon
For the record, I think we're all talking about a few different things here. And it's made more difficult by the fact that I'm not really clear what the requirements are.

Yes, c5 does store everything in /files/xxxx/yyyy/zzzz/some-file.ext . And that includes not just images, but zip files, etc. Also, duplicates for new versions, and a bunch of .html files to keep the directories secure.

But the caching is in straight paths, at /files/cache/MD5STRING.jpg . You also have the auto-generated thumbnails in another directory (with the /xxxx/yyyy... nomenclature).

You can really easily get the current version of files using the FileList API, filter for the appropriate file types, and go to town. This queries the Files and FileVersions table, along with all the appropriate attributes. It'd be quite easy... maybe 15 lines of API-y code.

But this won't tell you which images are "active", or what version is displayed on your front-end (ie, you might have a 2000x2000 pixel image, but the front-end only displays a 100x100 version).

To do that, you'd need to scrape your own site, I think. Doing this programmatically is both really easy and really hard. Look at the page index job. It loops through all pages and then loops through their blocks to get searchable content. You'd have to do something similar, though unlike the getIndexableContent() call, you'd have to use the normal view, and then parse that for links to images. And filter out images that you don't find relevant (e.g., images that are part of a theme). Also, you lose any context to the original image (assuming that it's a cached image).

But I've never heard of an image sitemap in the same context as a normal sitemap, so I'm not sure what the goal is here. Maybe all you need is "all active images"....
mhawke replied on at Permalink Reply
mhawke
Thanks jshannon for expanding on this. I think the original point I was trying to make in my first response is that this needs to be a top down approach (scraping) verses a file system approach because of how C5 stores the images and the versioning/thumbnail issues you mention.
slafleche replied on at Permalink Reply
slafleche
Hi jshannon, thanks for your post!
Google does use an image sitemap when available. Here's documentation from Google:http://support.google.com/webmasters/bin/answer.py?hl=en&answer...

Google even accepts video sitemaps, code sitemaps, news site maps!

Google also talks about this in their "Search Engine Optimization Starter Guide", page 19
http://static.googleusercontent.com/external_content/untrusted_dlcp...

Having these alternative image sitemaps might not be relevant to all websites, but for some it could be a plus. Some sites get good traffic via Google Image Search.

Having an automated job that generates an image sitemap seemes like an easy way to boost SEO.
JohntheFish replied on at Permalink Reply
JohntheFish
I have been thinking about this and could maybe solve it with some enhancements to Magic Heading.

It already scrapes text into meta data, so could do the same for images into an attribute or database table. It would then be a case of a dashboard interface or job to parse the attributes or table and create the image site map.
slafleche replied on at Permalink Reply
slafleche
If we only think from Google's point of view, does anybody know how the cache will affect google's results when google crawls the site? My guess is that having an "ugly" path from the Files folder and/or the cache folder isn't great, but if the image has descriptive alt text, maybe it's not that big of a deal? I'm new to SEO, so I don't really know how much this matters.
jshannon replied on at Permalink Reply
jshannon
Hmm...

So I looked at the google specs and it seems that, at least ideally, you're including the "important" images within any given page, rather than just a dump. So you definitely need some scraping of the front-end, as Mhawke has said.

But this wouldn't be a file so much as a modification of the current page sitemap. Which wouldn't be any more difficult than creating the image sitemap in the first place....

One thing the google spec does mention is that it only "helps" their engine figure out the correct images to index. Much like the sitemap. Personally, I never bother with a sitemap.

I would argue that there'd be a lot more value in tweaking the way that images are stored (or, at least, referenced) than in doing a sitemap. I looked at a site I have and it appears that the content block uses the caching but sets an alt image tag to the file name while the image block uses the original file (with the filename at the end). But if google "weights" images similarly to pages (no guarantee) then the filename matters, and the directory depth is a negative....

After that, you'd want to loop through the content / image / other blocks and pick the images out. You might want to pull out the imageID (rather than the final path) and then use that to get the full title, description, alt text, etc for the sitemap. You could also check for a "is_important" attribute, so you don't dump all images into the sitemap. This is probably a lot more valuable than just a listing of every image.
Steevb replied on at Permalink Reply
Steevb
Just my two pence worth...

..you want Google to find you with image search?

Is this what the op is all about?

C5 does a great job of being Google friendly.

Example: search for '9x5 roulette table' and choose images.

Over 50% of images (roughly first 150), return my clients images (blackdoggames.co.uk).

Maybe a funny file name, but works fine, with links to images and websites.
jshannon replied on at Permalink Reply
jshannon
I think c5 does a fine job, but probably not a great job. And -- unfortunately -- the name of the game in SEO is to eek out a bit more "juice" than the other guy.

For example, a search for '9 x 5 roulette table' or "9'x5' roulette table" or "5x9 roulette table" drops your client down to ~ 3 in the first 6 rows. I don't even see your client come up for "roulette table".

Arguably, having the images able to be better indexed, with more meta data, and "stronger" relevance would provide better results.
jasteele12 replied on at Permalink Reply
jasteele12
No offense, but that's just Google pulling the info from your content.

The first 9x5 roulette table result there actually looks like this (not even an alt tag):

<img width="480" height="320" src="/files/cache/b0242afe81f160dfca7e8a46d4f9b78f_f121.jpg"></img>

If you change the query to 9'x5' roulette table or 9 x 5 ft roulette table - that site completely disappears.

So unfortunately concrete5 breaks Google's very 1st Image publishing guideline [ http://support.google.com/webmasters/bin/answer.py?hl=en&answer... ] - Give your images detailed, informative filenames...

Right after uploading a descriptive filename c5 munges it.

(Looks like James was thinking along the same lines while I was doing some testing - knew I should have posted before that :)
Steevb replied on at Permalink Reply
Steevb
No offense taken, but there are many variables.

This was just quick indication on the subject.

Although "9'x5' roulette table" may not produce the prefered result, "9 x 5 ft roulette table" does, for us.

I'm not sure 'Joe' is going to bother to type "9'x5'", and most of the time Google will help with the end result anyway.

The point of my input was that C5 does a pretty good job.

Just give your image a reasonable resolution without being to heavy, a relative name (matching actual image), a good description and title.
jasteele12 replied on at Permalink Reply
jasteele12
Yeah, you pretty much have to do Google searches from a cookie-less browser, and be aware of local-biased results. That site didn't show anywhere near the top for those searches from here (Oregon, USA).

My point was that none of those criteria (other than maybe size) are applied to those site images. c5 munges the filename in the cache, and you have none of those attributes above, only width and height.

Just as an FYI for others, this image file naming happens even if you have all caching disabled.

The sitemap allows for caption, title and geo_location among others. The fact that both Drupal and Wordpress have plugins is probably a good clue also.

A note for those who manually clear their cache directory: If you remove the image files, they are not recreated until the blocks that contain them are rendered. That means any request for the image before that gets a 404 not found error. So don't do that :)