<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Comparing 160-bit hash values with SSE</title>
	<atom:link href="http://skitten.org/blog/2009/09/13/comparing-with-sse/feed/" rel="self" type="application/rss+xml" />
	<link>http://skitten.org/blog/2009/09/13/comparing-with-sse/</link>
	<description>Blog for stuff that isn't food, drink or not having a car</description>
	<lastBuildDate>Fri, 23 Jul 2010 20:21:51 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
	<item>
		<title>By: onitake</title>
		<link>http://skitten.org/blog/2009/09/13/comparing-with-sse/comment-page-1/#comment-1429</link>
		<dc:creator>onitake</dc:creator>
		<pubDate>Mon, 21 Dec 2009 22:16:20 +0000</pubDate>
		<guid isPermaLink="false">http://skitten.org/blog/?p=37#comment-1429</guid>
		<description>thank you for the post.
i was optimizing a small vector math library i wrote and had some trouble with the result of cmpeqps. what is it good for if i have to transfer a whole xmm register into memory to evaluate the result?
movmskps was exactly what i was looking for.

in my case, the work wasn&#039;t futile: my hand-optimized dot product runs about 10% faster than the mulss/addss version generated by gcc 4.2&#039;s tree-ssa optimizer, while my cross product routine reduces benchmark runtime by 20%. not as much as i expected, but i&#039;m sure there&#039;s still room for improvement.
i suspect there are still some unneccessary memory accesses. they hurt performance much more than the occasional serialization...</description>
		<content:encoded><![CDATA[<p>thank you for the post.<br />
i was optimizing a small vector math library i wrote and had some trouble with the result of cmpeqps. what is it good for if i have to transfer a whole xmm register into memory to evaluate the result?<br />
movmskps was exactly what i was looking for.</p>
<p>in my case, the work wasn&#8217;t futile: my hand-optimized dot product runs about 10% faster than the mulss/addss version generated by gcc 4.2&#8242;s tree-ssa optimizer, while my cross product routine reduces benchmark runtime by 20%. not as much as i expected, but i&#8217;m sure there&#8217;s still room for improvement.<br />
i suspect there are still some unneccessary memory accesses. they hurt performance much more than the occasional serialization&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joey</title>
		<link>http://skitten.org/blog/2009/09/13/comparing-with-sse/comment-page-1/#comment-1347</link>
		<dc:creator>Joey</dc:creator>
		<pubDate>Tue, 08 Dec 2009 02:44:53 +0000</pubDate>
		<guid isPermaLink="false">http://skitten.org/blog/?p=37#comment-1347</guid>
		<description>This is funny. Then sad. I shall continue to avoid assembler.</description>
		<content:encoded><![CDATA[<p>This is funny. Then sad. I shall continue to avoid assembler.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
